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i.  Hjpggi 


1.1  SCOPE 

This  report  discusses  the  work  performed  for  the  U.  S.  Angr  Signal 
Electronics  Research  and  Development  Laboratory  (TJS1KRDL)  under  Contract 
No.  M  36-039-SC-90787  daring  the  period  fro*  1  January  1963  to  31  March  1963. 

1.2  OBJECTIVES 

The  objective  of  this  project  is  to  investigate  the  techniques  and 
concepts  of  information  retrieval  and  to  formulate  and  develop  a  general 
theory  of  information  retrieval.  The  formalisation  of  this  theory  is 
oriented  to  the  automation  of  large  -capacity  information  storage  and 
retrieval  systems.  This  theoretical  framework  will  he  the  basis  for  the 
nee  of  general  purpose  stored-program  digital  computer  systems  to  perform 
the  storage  and  retrieval  functions. 

1.3  aoacT  TASKS 

The  task  structure  described  in  this  section  is  based  upon  the  infor¬ 
mation  retrieval  model  specified  in  the  First  Quarterly  Report  to  TBAXRHi, 
the  framework  elaborated  for  it  in  the  Second  Quarterly  Report,  and  sub¬ 
sequent  discussions  with  IffUKRITr.  project  personnel.  This  structure  is 
intended  as  an  organisational  guide  for  continuing  investigations .  It  is 
not  Intended  to  exclude  constructive  effort  in  task  areas  that  nay  not 
have  been  foreseen,  nor  is  it  likely  that  ell  the  tasks  and  subtasks 
spsoified  will  receive  squally  intensive  treatment. 

The  goal  of  this  project  is  a  theory  or  e  model  of  s  fully  automated 
information  content  storage  and  retrieval  systems.  The  task  structure 
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deals  with  fort r  areas  of  proeodnral  capability  that  must  be  developed 
if  this  goal  is  to  be  aebisved: 

(a)  Input  oapabilltles. 

(b)  Query  capabilities . 

(o)  Processing  capabilities. 

(d)  Information  retrieval  system  theory  and  integration  (integrative 
capabilities). 

The  first  three  are  e  are  roughly  analogous  to  the  D,  E,  and  P  transforms 
of  the  basic  information  retrieval  model.  The  last  area  is  a  supra- 
ordinate  category  that  indirectly  involves  the  other  three.  Each  of  these 
areas  will  be  briefly  considered  as  tasks,  salient  subtasks  mill  be 
described,  and  the  interrelationshipe  between  various  tasks  and  subtasks 
mill  be  pointed  oat* 

1.3.1  Input  Capability  -  In  the  ultimate  system  written,  printed, 
or  oral  material  in  natural  language  should  be  aooepted  as  input  for 
automatic  processing  and  analysis  at  the  morphological,  syntactic,  seman¬ 
tic,  logical,  and  factual  levels,  is  s  consequence  0f  such  input  proces¬ 
sing,  all  explicit  and  Implicit  or  factual  reference  of  the  input  material 
should  be  appropriately  displayed  or  eluoidated  for  further  processing 
In  response  to  queries. 

In  large  measure,  aost  of  these  potential  capabilities  are  outside 
the  scope  of  this  project.  Visual  or  auditory  pattern  recognition 
devices  for  reading  or  listening  to  natural  language  are  an  ancillary 
problem  that  may  be  left  for  separate  development.  Linguistic  analysis 
has  been  eliminated  aa  a  primary  focus  of  tbs  project,  and  an  attempt  to 
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achieve  mechanical  tinders  tsnding  of  text  is  thus  also  beyond  tbs  scope 
of  this  research  activity . 

Whether  read-in  and  linguistic  analysis  are  completely  automated  or 
not,  a  oentral  problem  in  the  transformation  of  information  inputs  into 
forms  useable  in  storage  and  retrieval  is  the  classifying,  oategorising, 
or  indexing  process.  Such  a  classification  stage  is  essential  regard¬ 
less  of  the  degree  of  sophistication  or  automation  in  an  information 
storage  and  retrieval  system. 

Capabilities  in  this  area  are  currently  quite  limited.  To  date  opera¬ 
tional  olassifioatory  schemas  tend  to  be  intuititvely  formulated  and  man¬ 
ually  implemented.  Furthermore,  there  are  no  systematic  procedures  far 
improving  the  precision  of  categories  to  assure  that  the  denotations  need 
by  the  system  properly  correspond  to  the  denotations  understood  or  desired 
by  the  user.  Accordingly  there  are  three  major  srtotasks: 

(a)  The  development  of  explicit  procedures  for  establishing  useful 
oategory  groupings  and  boundaries. 

(b)  The  development  of  procedures  for  automatically  assigning  items 
to  olassifioatory  categories. 

(o)  The  development  of  methods  for  lagnrovlag  the  precisian  of 
oategory  denotation  between  the  system  and  the  user. 

Before  considering  each  subtask  in  further  detail,  it  should  bo  noted 
that  they  need  not  be  completely  independent.  Ultimately,  these  capa¬ 
bilities  cannot  be  fully  developed  without  reference  to  other  system  capa¬ 
bilities— i.e.,  query  and  processing.  Furthermore,  subtasks  (a)  and  (b) 
may  nsrgs  into  a  single  theoretical  and  procedural  statistical  eoheae  for 
both  selecting  categories  and  assigning  items  to  thorn.  Such  interdependenoe 
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in  dynamic  systems  does  not,  however,  preclude  an  essentially  parallel 
attack  on  separate  aspeets  of  the  capabilities  problem. 

1. 3.1.1  Development  of  Explicit  Procedures  for  Establishing 
Useful  Category  groupings  and  Boundaries  -  There  is  already  a  mathemat¬ 
ical  literature  on  the  problem  of  category  formation  based  upon  measures 
of  relevance  between  the  units  to  be  grouped.  This  literature  will  not 
be  reviewed  here,  such  a  review  being  a  preliminary  phase  of  work  on  each 
of  the  tasks  and  subtasks.  Some  of  the  relevant  work  involves  factor 
analysis,  latent  class  analysis,  and  the  theory  of  clumps. 

There  are  two  kinds  of  applications  for  such  explicit  procedures 
in  establishing  categories t 

(a)  Far  grouping  or  Indexing  do  wants — l.a.,  items  reoeived  by  the 
information  system— into  larger  categories.  This  application 
is  essentially  the  same  function  that  intuitive  library  clas¬ 
sification  sohamas  currently  serve. 

(b)  For  finding  salient  boundaries  within  documents  or  items  to 
analyse  thorn  into  meaner  uoeable  porta.  As  the  number  of 
parts  inoressss  the  srtdLculaiion  of  their  lnterrelation- 
ehtps  increase  in  sophistication,  the  goal  of  input  analysis 
is  approached  in  evolutionary  stages. 

1.3. 1.2  Development  of  Prooedaree  for  Antomatioallr  Aeaianlne 
Items  to  Classlfloatory  Cateaorios  -  The  purpose  of  this  subtask  is  two¬ 
fold:  first,  to  eliminate  subjectivity  in  the  classification  of  library 
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items  and  thus  to  increase  precision)  second,  to  alleviate  the  tins  and 
oost  required  for  manual  classification.  It  is  irrelevant  to  these  pur¬ 
poses  vhether  the  classifioatoxy  scheme  is  systematically  developed  as 
membership  in  exclusive  categories  or  vhether  a  traditional  scheme  is  to 
be  implemented  automatically.  There  may,  however,  be  differences  in  the 
complexity  and  difficulty  of  the  automatic  classification  problem  based 
upon  the  type  of  elassificatory  schemes  used.  Specifically,  the  question 
of  independence  among  olassificatoxy  categories  and  type  of  class  mem¬ 
bership  may  affect  the  nature  of  the  automatic  olassificatoxy  procedures. 

Two  kinds  of  problems  can  be  distinguished  as  follows: 

(a)  Membership  in  exclusive  categories.  This  situation  exists  when 
categories  are  exclusive.  An  item  can  be  assigned  to  only  one 
category  and  not  assigned  to  the  remainder.  Clue  word  schemes 
developed  to  date,  including  the  approach  reported  in  the  Seoond 
Quarterly  Report,  axe  essentially  limited  to  such  classification. 
This  type  of  classification  exists  in  traditional  hierarchical 
schemes  such  as  the  Dewey  or  Library  of  Congress  systems.  Such 
hierarchies,  if  veil  conceived,  have  the  advantage  that  cruder 
discriminatory  or  predictive  techniques  can  be  applied  to  higher 
level  distinctions  until  more  precise  methods  axe  available  for 
dealing  with  lower  level  distinctions. 

(b)  Decree  of  inclusion  and/or  nonexclusive  categories.  This  problem 
is  a  mors  general  oass  in  which  an  item  may  be  not  only  assigned 
to  a  given  eategoxy,  but  also  assigned  to  some  degree  as  relevant 
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to  a  category — and/or  assigned  to  more  than  one  category 
simultaneously.  The  systems  included  under  exclusive  categories 
are  a  special  case  of  nonexclusive  categories,  and  the  general 
case  will  require  more  sophisticated  treatment.  While  opera¬ 
tional  systems  do  not  yet  extensively  use  category  assignment 
by  degree  of  relevance,  newer  Unitem  or  coordinate  indices 
already  use  multiple  category  assignment  per  document.  As 
increasingly  articulated  category  assignment  becomes  possible 
automatically,  the  ultimate  goal  of  the  project  is  approached. 

1.3. 1.3  Development  of  Methods  for  Improving  the  Precision  of 
Category  Denotation  Between  the  System  and  the  User  -  Assuming  that  cat¬ 
egories  and  item  assignment  have  somehow  been  arrived  at,  whether  intui¬ 
tively  and  manually  or  explicitly  and  automatically,  the  system  cannot 
function  optimally  unless  category  denotation  agrees  with  useage.  Since 
it  is  unlikely  either  that  the  system's  denotations  will  agree  perfectly 
with  those  of  the  average  user  or  that  the  denotations  of  the  users  will 
agree  perfectly,  there  are  two  kinds  of  problems  that  can  currently  be 
isolated  within  this  subtask: 

(a)  Corrective  procedures.  These  procedures  refer  to  the  applica¬ 
tion  of  user  feedback,  along  with  assisted  invariances  between 
the  user  and  system  denotations,  to  adjust  the  assigned  item 
content  in  the  system's  categories.  A  fuller  account  of  an 
approach  to  this  problem  is  included  in  the  Second  Quarterly 
Report. 
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(b)  Non -Boolean  retrieval.  This  function  refers  to  the  problem  of 
using  criteria  for  averaging  or  optimising  category  membership 
under  conditions  of  user  disagreement.  It  is  not  generally  the 
case  that  an  optimization  criterion  for  category  membership — 
e.g.,  50  percent  user  agreement  on  an  item  places  it  within  the 
category— will  be  fulfilled  for  Boolean  functions  of  individually 
optimised  categories.  That  is,  the  union  of  two  50-percent 
agreement  categories  may  not  contain  only  those  documents  on 
which  there  was  50-percent  agreement  that  they  belong  in  both 
categories.  Hence,  non-Boole  an  retrieval  functions  are  needed 
to  resolve  this  problem. 

1.3.2  Query  Capabilities  -  Many  of  the  general  considerations  regard¬ 
ing  pattern  recognition,  linguistic  analysis,  and  problem  interrelations 
discussed  under  input  capabilities  are  also  relevant  as  functional  aspects 
of  queries.  The  situ.  Ion  is  so  similar,  however,  that  a  repetition  of 
this  discussion  in  the  query  capability  context  is  unnecessary.  As  in 
the  case  of  input  capabilities,  the  query  problem  will  be  attacked  from 
the  viewpoint  of  relaxing  the  limitations  of  current  information  storage 
and  retrieval  systems. 

In  most  operational  systems  the  possible  query  is  essentially: 

"What  documents  in  the  system  contain  information  of  the 
following  kind _ ?" 

There  are  at  least  three  limitations  on  this  form  of  query  that  require 
resolution  before  more  sophisticated  information  storage  and  retrieval 
systems  are  possible: 
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(a)  Limitation  to  documents. 

(b)  Limitation  to  unrestricted  retrieval  of  all  items. 

(c)  Limitation  on  description  of  type  of  information  desired. 

Each  of  these  limitations  will  be  considered  as  subtasks. 

1.3. 2.1  Limitation  to  Documents  -  The  query  capability  should 
be  extended  so  that  a  system  may  respond  with  appropriate  portions  of 
documents  rather  than  documents  as  a  whole.  The  input  capability  described 
tinder  explicit  techniques  for  salient  boundaries  is  essential  for  satisfy¬ 
ing  this  query  capability.  Another  approach  might  involve  the  extension 
of  work  already  done  in  automatic  abstracting  or  extracting,  which  selects 
salient  information  from  documents  rather  than  merely  salient  portions  of 
documents. 


Such  a  capability  cannot  be  provided  in  a  vacuum.  Input  capa¬ 
bilities  must  provide  indices  to  document  parts  as  well  as  isolate  them. 
Processing  capabilities  must  provide  means  of  associating  such  document 
parts  with  the  query.  These  considerations  apply  equally  to  the  remain¬ 
ing  subtasks  considered  under  query  capabilities. 

1.3. 2. 2  Limitation  to  Unrestricted  Retrieval  of  All  Items  - 
The  purpose  of  this  subtask  is  essentially  the  same  as  that  for  the 
preceding  one— viz . .  to  reduce  necessary  search  activity  on  the  part  of 
the  user  by  performing  it  within  the  system.  Only  in  specialised  schol¬ 
arly  situations  does  the  user  need  all  documents  that  are  potentially 
relevant  to  his  query.  There  are  two  problem  areas  suggested  for  this 
sub task: 
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(«)  Elimination  of  (low  quality)  redundancy.  In  many  fields  there 
is  ft  proliferation  of  documents  covering  the  sane  topics.  Many 
of  these  documents  nay  also  be  low  in  quality.  It  is  desirable 
to  increase  the  sophistication  of  indexing,  an  input  capability 
essentially,  so  that  the  contents  of  an  item,  eren  if  it  is 
only  part  of  a  document,  are  described  or  classified  not  only 
according  to  what  topics  they  are  relevant  but  also  according 
to  the  degree  of  uniqueness  the  topics  are  dealt  with.  It 
seems  that  such  indexing  could  not  be  readily  achieved  using 
purely  statistical  means  and  this  capability  nay  be  one  of  the 
most  difficult  to  automate. 

(b)  Specification  of  scope.  In  addition  to  weeding  out  redundant 
or  low  quality  materials,  it  would  be  desirable  to  be  able  to 
restrict  the  scope  of  retrieval  on  a  given  query  according  to 
the  needs  of  the  user.  This  function  obviously  would  Involve 
considerations  of  relevance  and  its  measurement  as  well  as 
integBStion  with  the  mode  in  which  desired  information  is  char¬ 
acterised.  The  latter  requirement  is  also  considered  in  the 
following  sub  task. 

1.3. 2. 3  Limitation  on  Description  of  Type  of  Information  Desired 
Different  operational  information  systems  impose  different  limitations  of 
this  type.  A  hierarchically  organised  index  or  query  language  nay  produce 
such  unusual  classifications  of  new  materiel  that  a  subsidiary  index  is 
necessary  in  order  to  use  the  primary  index  properly.  Freer  Uniterm  sys¬ 
tems  ere  limited  to  Boolean  functions  of  two-valued  descriptors;  the 
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descriptor  is  either  present  or  absent.  The  use  of  role  indicators  and 
similar  devices  offer  some  possibility  of  improving  the  query.  But  the 
crux  of  the  problem  is  to  develop  a  query  capability  that  allows  a  user 
to  state  his  question  precisely.  This  ability  is  essential  to  useful 
content  retrieval. 

It  should  be  noted  that  the  problem  of  designing  an  adequate 
query  or  descriptor  language  for  the  purposes  of  the  user  has  an  analog 
in  the  design  of  an  adequate  representation  of  this  language  for  machine 
processing.  The  design  of  the  query  language  must,  therefore,  take  into 
consideration  problems  of  machine  representation  and  processing  as  well* 

1.3.3  Processing  Capabilities  -  Advances  in  information  storage  and 
retrieval  depend  upon  improved  processing  algorithms.  Advances  in  the 
other  capabilities  will  influence  the  choice  of  processing  techniques. 

It  is,  consequently,  difficult  to  define  relatively  independent  problems 
in  advance.  In  the  present  state  of  development,  the  processing  task 
can  be  subdivided  into  two  major  subtasks: 

(a)  Associative  techniques. 

(b)  Organisation  and  search. 

1.3. 3.1  Associative  Techniques  -  In  order  to  respond  to  queries 
with  appropriately  indexed  documents,  an  Information  system  must  hove 
techniques  for  associating  the  two.  In  simple  systems  queries  and  index 
categories  are  so  limited  in  differentiation  that  the  association  problem 
may  become  trivial.  As  greater  flexibility  is  introduced  in  the  query 
language  and  as  input  capabilities  are  improved,  supplying  appropriate 
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information  requires  associative  capabilities 


Ctoe  aspect  of  this  problem,  the  measurement  of  relevance,  has 
already  been  considered  in  the  First  Quarterly  Report.  Such  measures 
are  relevant  both  to  input  and  query  capabilities  as  well  as  to  asso¬ 
ciative  processing  in  response  to  queries .  Further  work  is  required  on 
the  development  of  associative  techniques  using  such  measures  of  relevance . 

1.3. 3*2  Organisation  and  Search  -  There  is  a  sense  in  which 
file  organisation  may  differ  from  search  theory  and  procedures.  At  a 
system  design  level,  however,  these  considerations  become  inseparable. 

Thus,  while  file  organisation  may  be  abstractly  distinguished  from  the 
procedures  used  to  search  a  file,  in  practice  the  theoretical  work  in 
one  area  depends  upon  extensive  explicit  or  implicit  assunptions  about 
the  other.  Accordingly,  organisation  and  search  are  treated  in  a  single 
sub task. 


Both  organisation  and  search,  however,  can  be  conveniently 
divided  into  two  aspects,  logic  and  efficiency. 

(a)  logical  aspects .  The  logical  aspects  refer  to  organisation  or 
search  procedures  based  upon  logical  relations  that  are  inherent 
in  the  subject  natter  and  the  system  and  are  essential  to  per¬ 
forming  the  processing.  ftramples  of  logical  organisation  are 
alphabetization,  hierarchies,  or  matrices. 

(b)  Efficiency.  Superimposed  on  a  given  logical  organization  are 
considerations  of  efficiency.  These  problems  are  most  influenced 
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by  the  relative  activity  of  different  portions  of  the  system, 
the  nature  of  the  information  in  the  system,  and  the  physical 
nature  of  the  system.  Efficiency  considerations  lead  to  rear¬ 
rangements  within  a  given  logical  organization  for  performing 
a  system  task  at  a  minimum  cost. 

1.3.U  Information  Retrieval  System  Theory  and  Integration:  Integra¬ 
tion  Capabilities  -  This  task  did  not  appear  as  a  separate  unit  in  the 
Second  Quarterly  Report.  At  that  time  it  appeared  that  it  could  be  handled 
under  processing.  That  report  noted  that  some  of  the  tasks  included  under 
processing  were  of  a  supra-ordinate  nature,  referring  to  the  capabilities 
of  information  systems  as  a  whole  rather  than  to  specific  input,  query, 
or  processing  '-^abilities.  After  reviewing  the  framework  in  that  report, 
it  was  decided  to  consider  these  factors  as  a  separate  task. 

There  are  three  sub tasks  in  this  area: 

(a)  Measures  and  models  of  system  value  and  efficiency. 

(b)  Models  and  methods  for  system  integration  and  optimisation. 

(c)  General  theoretical  considerations. 

It  is  apparent  from  this  simple  enumeration  that  while  such  considera¬ 
tions  must  permeate  work  in  the  other  three  areas  of  capability,  a  sep¬ 
arate  treatment  is  warranted  in  a  project  aimed  at  the  development  of  a 
general  information  system  theory  or  design  methodology.  Each  of  the 
subtasks  will  now  be  briefly  considered. 

1.3.U.1  Measures  and  Models  of  System  Value  and  Efficiency  - 
This  subtask  is  addressed  to  the  development  of  a  capability  to  answer 
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such  questions  as: 

Do  we  need  a  new  information  system? 

If  so,  what  is  its  value? 

What  kind  of  system  would  most  efficiently  serve  our  needs? 

Value  and  efficiency  do  not  refer  merely  to  the  cost  and  specifications 
of  individual  pieces  of  hardware.  Such  engineering  problems  must  ultimately 
be  resolved  in  the  design  of  any  given  system,  but  detailed  consideration 
of  these  factors  is  outside  the  scope  of  this  project.  Value  and  efficiency 
thus  refer  to  theoretical  measures  and  models  of  the  necessity  and  adequacy 
of  the  system  as  a  whole. 

1.3. U.2  Models  and  Methods  for  System  Integration  and  Optimiza¬ 
tion  -  This  subtask  will  deal  with  the  problem  of  integrating  specific 
configurations .  Work  on  this  subtask  is  to  some  extent  dependent  upon 
value  and  efficiency  models,  but  the  focus  is  upon  theoretical  methods 
rather  than  specific  engineering  considerations. 

1.3. U.3  General  Theoretical  Considerations  -  This  subtask  is 
included  to  allow  for  work  on  the  development  of  ideas  that  may  emerge 
on  the  nature  of  information  storage  and  retrieval.  It  constitutes  an 
admission  that  the  task  and  subtask  structure  may  not  yet  contain  the 
germinal  or  organising  principles  for  a  general  theory  of  information 
retrieval. 


1.3.5  Stannary  -  A  task  framework  has  been  described  in  terms  of 
anas  of  capability  that  require  development  in  order  to  evolve  fully 
automatic,  factual  content,  information  storage  and  retrieval  systems. 
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An  outline  of  the  task  framework  follows. 

(a)  Input  capabilities. 

(1)  Explicit  procedures  for  establishing  useful  category 
groupings  and  boundaries. 

a.  Larger  groupings. 

b.  Internal  boundaries. 

(2)  Procedures  for  automatically  assigning  items  to  clas- 
sificatory  categories. 

a.  Exclusive  categories. 

b.  Non-exclusive  categories. 

(3)  Methods  for  improving  the  precision  of  category  denotation 
between  system  and  user. 

a.  Corrective  procedures. 

b.  Non -Boolean  retrieval. 

(b)  Query  capabilities. 

(1)  Relax  limitation  to  documents. 

a.  Portions  of  documents. 

b.  Abstracts  or  extracts. 

(2)  Restricted  retrieval. 

a.  Elimination  of  redundancy. 

b.  Specification  of  soope. 

(3)  Relax  limitations  on  description. 

(c)  Processing  capabilities. 

(1)  Associative  techniques , 

(2)  Organisation  and  search. 

a.  Logic. 

b.  Efficiency. 

< 

a 


(d)  Integration  capabilities . 

(1)  Measures  and  models  of  system  value  and  efficiency. 

(2)  Models  and  methods  for  system  Integration  and  optimisation. 

(3)  General  theoretical  considerations. 

in  attempt  to  classify  both  work  planned  and  accomplished,  as  well  as 
literature  reviews,  will  continue  in  terms  of  the  task  framework  presented 
in  this  section.  Such  a  process  will  allow  the  framework  to  be  articulated 
or  revised  as  it  is  tested  in  practice. 
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2.  ABSTRACT 


Work  in  each  of  the  four  areas  of  capability  isolated  in  the  project 
task  structure  has  been  performed  in  the  past  quarter.  Under  input  capa¬ 
bilities  an  extension  of  last  quarter's  work  on  procedures  for  automatic 
assignment  has  been  accomplished  and  the  development  of  a  probabalistic 
non-Boolean  retrieval  model  has  bean  initiated.  Under  query  capabilities 
new  approaches  to  the  problems  of  limitation  to  documents,  automatic 
extracting  or  abstracting,  and  restricted  retrieval — elimination  of 
redundancy— have  been  developed. 

The  work  on  non-Boolean  retrieval  is  also  relevant  to  the  query  capa¬ 
bilities  subtasks  concerned  with  the  specification  of  scope  and  the  relaxa¬ 
tion  of  limitations  on  descriptions.  Older  processing  capabilities  there 
is  no  new  progress  to  be  reported  on  associative  procedures,  ’.at  extensive 
mathematical  analysis  has  been  initiated  on  the  problem,  of  file  organi¬ 
sation  and  search.  Finally,  under  integrating  capabilities  some  general 
theoretical  considerations  have  evolved  that  should  lead  to  measures  and 
models  of  system  value  and  optimisation  in  item  retrieval  systems. 
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3.  PUBLICATIONS.  REPORTS.  AND  CONFERENCES 

3.1  TECHNICAL  NOTES 

The  following  internal  technical  memoranda  were  issued  during  this 
reporting  period: 

(a)  EEC  TECHNICAL  NOTE,  Pile  No.  P-AA-TN-(0050)-N,  18  February  1963} 
Task  Framework  for  Continuation  of  Information  Retrieval  Research. 
George  Greenberg,  Quentin  A.  Darmstadt,  Alexander  Szejman,  and 
Alfred  Trachtenberg. 

(b)  IEC  TECHNICAL  NOTE,  File  No.  P-AA-TN-(005l)-N,  25  February  1963} 
Analysis  of  File  Organisations  for  Information  Retrieval, 

Quentin  A.  Darmstadt. 

(c)  IEC  TECHNICAL  NOTE,  File  No.  P-AA-TN-(OQ58)-N,  19  March  1963} 

An  Approach  to  a  Criterion  for  Automatic  Extracts,  George  Greenberg 
and  Alexander  Szejman. 

(d)  IEC  TECHNICAL  NOTE,  File  No.  P-AA-TN-(006U)-N,  25  March  1963} 
Non-Boolean  Retrieval  Processes.  Alexander  Szejman. 

(e)  IEC  TECHNICAL  NOTE,  File  No.  P-AA-TN-(0069)>N,  25  March  1963} 

The  Problem  of  Redundancy  in  the  Information  Retrieval  Systems. 
Alexander  Ssejman. 

(f)  IEC  TECHNICAL  NOTE,  File  No.  P-AA-TN-(0070)-N,  25  March  1963} 
Information  Theoretical  Methods  of  Document  Categorisation  Using 
Word  Frequency  Information,  Alfred  Trachtenberg. 

These  technical  notes  are  dated  at  the  time  of  their  completion;  these 

dates  do  not  necessarily  correspond  to  the  date  of  publication. 

3.2  REPORTS 

The  following  reports  were  issued  during  this  reporting  period: 

(a)  RESEARCH  IN  INFORMATION  RETRIEVAL:  Second  Quarterly  Report. 

1  October  1962  -  31  December  1962,  Technical  Report  P-AA-TR-(0031), 
(Manuscript  Version),  31  January  1963. 
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(b)  MONTHLY  LETTER  REPORT  NO.  $,  1  January  1963  -  31  January  1963, 
Pile  No,  P-AA-TR-(0032),  31  January  1963;  Research  In  Informa¬ 
tion  Retrieval,  Alfred  Trachtenberg. 

(c)  MONTHLY  LETTER  REPORT  NO.  6,  1  February  1963  -  28  February  1963, 
File  No,  P-AA-TR-(0033),  28  February  1963}  Research  In  Informa¬ 
tion  Retrieral,  Alfred  Trachtenberg. 

3.3  CONFERENCES 

The  following  conferences  were  held  between  IEC  personnel  and  the 

USAERDL: 

(a)  28  February  1963 — Meeting  at  IEC.  IEC  personnel  net  with 

Mr.  Anthony  V.  Campi,  who  had  recently  been  assigned  as  Project 
Engineer.  Several  aspects  of  the  Second  Quarterly  Report  were 
discussed.  Several  minor  corrections  and  elaborations  were 
requested.  A  general  emphasis  on  the  importance  of  user  require¬ 
ments  was  indicated. 
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h.  FACTUAL  DATA 


U.l  CRQAKIZATION 

This  section  is  organized  according  to  the  four  major  areas  of  required 
capability  isolated  in  the  project  task  structure  (see  Section  1.3) . 

4.2  INPUT  CAPABILITIES 

Work  performed  under  input  capabilities  includes  both  an  extension 
of  the  last  quarter's  work  on  information  theoretic  methods  of  document 
categorisation  using  word  frequency  information  and  the  development  of 
a  scheme  for  non-Boolean  retrieval.  Thus  work  has  proceeded  on  Sections 
(a)(2)  and  (a)(3)  of  the  task  framework.  Work  on  (a)(2),  however,  is 
also  relevant  to  (a)(1).  Furthermore,  the  work  on  non-Boolean  retrieval 
has  implications  that  are  more  general  than  input  capabilities. 

4.2.1  Information  Theoretic  Methods  of  Document  Categorization  Using 
Word  Frequency  Information 

U.2.1.1  Introduction  -  In  the  last  quarterly  report,  some  infor¬ 
mation  theoretical  methods  of  document  classification  were  presented. 

These  methods  used  word  occurrences  as  clues  to  the  classification  of  a 
document.  The  number  of  times  a  word  occurred  in  a  document  was  not  con¬ 
sidered  at  that  time;  only  the  fact  of  its  occurrence  in  a  document  was 
used  to  predict  document  categories.  Thus  all  the  information  provided 
by  word  frequency  information  was  neglected.  This  extension  considers 
hew  such  information  can  be  used  to  provide  a  better  prediction  of  cate¬ 
gories  of  documents. 


21 


It  was  assumed  that  initially  a  group  of  human  experts  would 
classify  a  number  of  documents  into  a  given  set  of  categories,  and  that 
this  initially  classified  group  was  large  enough  accurately  to  reflect 
the  statistics  of  the  larger  body  of  documents  that  would  later  be  auto¬ 
matically  classified.  Thus  the  probabilities  of  categorization  of  the 
larger  group  of  documents  were  approximated  by  the  relative  frequencies 
of  categorization  of  the  initial  group  of  documents. 


The  criteria  used  for  selecting  a  particular  word  to  predict 
categories  were: 

(a)  That  its  occurrence  in  documents  be  strongly  correlated  with 
the  appearance  of  those  documents  in  a  particular  category,  for 
the  group  of  documents  that  would  be  initially  classified. 

(b)  That  the  word  supply  more  information  than  the  a  priori  distri¬ 
bution  of  documents  in  categories  did;  i.e.,  tTiat  the distri¬ 
bution  of  documents  containing  this  word  differ  markedly  from 
the  distribution  of  all  the  documents. 


These  criteria  were  expressed  mathematically  by  the  expressions : 


Hi 


*  pi3  l0«  pi3 


and 


Ki  ■  *  pu  ** 


U.-1) 


where  p^  is  the  probability  that  a  document  falls  into  category  Cj,  and 
p^j  is  the  probability  that  a  doctment  containing  word  falls  into  cat¬ 
egory  Cj.  Thus  a  good  predictor  would  have  a  low  and  a  high  M^. 


U.2.1.2  Extension  of  Concepts  to  Include  Word  Frequency  Infor¬ 
mation  -  There  are  several  ways  in  which  word  frequency  information  can 
be  taken  into  account  to  determine  good  predictors  of  document  categories. 
The  first  two  methods  use  absolute  values  of  word  occurrence  in  a  document. 
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while  the  third  method  usee  relative  word  frequency  in  a  document  to 
obtain  more  information. 


Let: 


Now: 


N  ”  the  total  number  of  documents  in  the  initial  group. 

-  the  number  of  documents  in  which  word  occurs. 

N^(x)  -  the  number  of  documents  in  which  word  occurs  x  times. 

-  the  number  of  documents  in  category  C^. 

n^j  -  the  number  of  documents  in  category  C .  which  have  word 
Wi* 

n,  .(x)  -  the  number  of  documents  in  category  C .  which  have  word 
X3  x  times.  3 


N.  -  S  N.(x) 
x 

»13  '  *  "y<x) 


}  (ll-2> 


Si  addition  to  the  probabilities  p^j  and  p^,  the  following  prob¬ 
abilities  can  be  defined.  Let: 

p^  “  the  probability  that  a  document  contains  word  W^. 

Pj  (x)  -  the  probability  that  a  document  contains  word  W. 
x  times.  1 

Pil(*)  “  the  probability  that  a  document  containing  word 
J  Wi  x  times  falls  into  category  Cy 

P(CVW.)  -  the  joint  probability  that  a  document  is  in  cat- 
J  egory  Cj  and  contains  word  W^. 

p[Cj»Vi(x)l  “  the  joint  probability  that  a  document  is  in  cat- 
J  egory  and  contains  word  x  times. 

Then  the  probabilities  can  be  approximated  as  follows: 
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n..(x) 

■  njfer 

p«VV  •  T1 

n,.(x) 

pEc^w^x)]  ■  — ft— 

Of  course: 

p  -  Z  p,(x) 

1  x  1 

v(OyV±)  •  Z  pCC^^Cx)] 

and  (x)  Is  related  to  p^  by  the  expression; 
I  PUW  Ht(x) 

p«  ‘  TTiJii - 

X 


>  (U-U) 


(U-5) 


(a)  Method  1  -  The  measures  ^  and  can  easily  be  generalized  to 
include  frequency  information  by  considering  word  occurring 
x  and  only  x  times  in  a  document  as  a  clue.  Then,  instead  of 
using  p^  in  and  a  new  probability  p.  ^  (x)  can  be  used. 


Two  new  measures,  H^(x)  and  M^x),  can  now  be  defined: 


H^x) 

\(x) 


-  *  hjM  108  »«<*> 

P^(x) 

Sp,,(x)  log -ML 


3 


P. 


3 


> 


(U-6) 


With  these  measures,  the  effectiveness  of  word  as  a  predictor, 
when  it  occurs  x  times  in  a  document,  can  be  evaluated.  As 
before,  H^(x)  must  be  low  and  M^(x)  must  be  high  for  a  good 
predictor. 


The  average  effectiveness  of  a  word  as  a  predictor  can  be 
measured  by: 

\(x)  -  <Hi(x»x 

\(x)  -  <\(x)>x 
Then,  on  the  basis  of  Equations  4-3  end  Ml,  it  follows  that: 


t 


(U-7) 


_  r  p1(x)  H.(x) 

Vx) "  iTpiW  ' 

x 


(U-8) 


and) 

\(x)  ■  -^SE  PlCj^Cx)]  log  Pi3(x) 

—  x  3 


Similarly: 


_  2  pj>(x)  ^(x) 

"i(x)  "  *  t  Pi(x) 
x 


(M9) 


(U-10) 


25 


Buts 


^(x)  +  H^x)  «  -  Z  p^Cx)  log  Pj 
therefore? 


(m^x)  +  H^x)^  -  M^x)  +  H^x) 

-  -  i  Z  E  p[C  ,W  (x)]  log  p 
pi  x  i  3  3 

■  ■  %  *  »(03’V  **  p3 


and,  by  substituting  Equation  U-3? 


Mi(x)  +  H^x)  -  -  E  py  log  Pj 


But: 

Hi  *  Hi  '  '  *  p13  108  p3 
therefore? 


^(x)  +  ^(x) 


(U-ll) 


(U-12) 


(U-13) 

(U-Ui) 


(U-15) 


(b)  Method  2  -  This  method  Is  similar  to  Method  1.  Instead  of  con¬ 
sidering  that  a  word  occurs  exactly  x  times  in  a  document,  this 
method  considers  that  a  word  occurs  between  xft  and  x^  times  in 
a  document.  In  other  words,  word  frequency  information  is  grouped 
in  intervals  of  frequency  of  occurrence,  Br«  For  exaiiple,  the 
frequency  intervals  might  be  1-5  times,  6-10  times,  etc. 

New  probabilities  must  be  introduced.  Let: 
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p.(B  )  -  the  probability  that  a  document  contains  word 

1  V.  z  times.  where  x  is  in  interval  B_. 

i  r 

p..(B  )  -  the  probability  that  a  document  containing  word 
**  x  tines  falls  into  category  C^,  where  x  is 

in  interval  B  . 

r 

p[C.f¥.(B_)]  ■  the  joint  probability  that  a  document  is  in 
3  category  Cj  eonUi*.  «rd  W±  x  tl— ,  »h.r. 

x  is  in  interval  B  . 

r 


Now  the  probabilities  can  be  expressed  as: 

P±(Br)  "  2  P±(x) 

1  r  x  €  Br  4 

p[C,,W  (B  )]  -  £  p[C  ,W  (x)] 

5  1  r  x€Br 

2  p..(x)  N.Cx) 

x  €  B„  13  x 


vv 


i ' 

x  €  B  x 


2  p[C.,V(x)] 

e  p  J  * 


x  €  B 


I "  p'(xT 


x  €  B. 


(U-16) 


Then,  following  Method  1  and  Equation  U-6,  expressions  may  be 
written  for  ^(Bj.)  ^(By). 


W  -  -  £  Pij(Br)  log  P^V 

p14(bJ 

W  ■  *  VBr>  108 

H,  (Br)  should  be  low  and  ^(Bj.)  should  be  high  for  a  good 
predictor. 


(U-17) 


I 
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Another  set  of  functions  that  measure  the  effectiveness  of  word 
as  a  predictor,  when  occurs  x  times  and  x  is  in  interval 
Br,  can  be  obtained  by  taking  the  average  values  of  H^(x)  and 
M^(x)  over  the  interval  Br.  The  average  effectiveness  is  meas¬ 


ured  bys 

H±(x,r)  -  <Hi(x))x  €  B 


Mi(x,r)  -  <Mi(x))x  €  B 
“  r 

Then,  by  using  Equation  it -11  as  in  Method  1: 


f  (U-18) 


<*i«  *  Mi(l»x  €  B. 


2  Pj/x)  P±J(x)  log  Pj 


X  6  B. 


x  €  B. 


rJTxT 


'  ‘  *  p[tVWl(Brn  *»«  f) 

■  -  S  P±j (Br)  log  Pj  (U— 19) 


But: 

VBr>  +  W  "  -  E  pi^Br)  lo*  pj 

V 

therefore) 

^(Bp)  +  \&r)  -  <HjL(x)  +  Mi(x)>x  €  B 


(lt-20) 


-  Hi(x,r)  +  Mi(x,r)  (U-21) 

If  this  quantity  [^(Bj.)  +  M^(Br)]  is  averaged  over  all  r,  then 
by  the  proof  outlined  for  Method  1: 
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%  +  \  -  H±(x)  +  M^x) 


-  <Hi(x,r)>r  ♦  <^(x,r))r  (U-22) 

Thus  the  sun  of  the  averages  of  the  two  measures  remains  constant 
and  is  independent  of  the  size  of  the  intervals  of  frequency  of 
occurrence. 

(c)  Method  3  -  This  method  considers  the  number  of  times  a  word 
appears  in  a  document  in  relation  to  the  total  number  of  words 
in  a  document  as  a  clue.  Using  this  relative  frequency  infor¬ 
mation  as  clues  should  provide  even  better  category  prediction 
than  word  occurrence  or  simple  word  frequency  information. 

Let  f  be  the  relative  frequency  of  a  word  in  a  decrement;  the 
relative  frequency  is  the  ratio  of  the  number  of  occurrences  of 
the  word  in  the  don— nt  to  the  total  number  of  words  in  the 
docment.  Let  f  be  an  interval  of  relative  frequencies,  where 
the  interval  is  defined  by  the  limits  fa  and  f^.  Then,  p1(fg) 
is  simply  the  probability  of  word  occurring  in  a  document 
with  a  relative  frequency  in  the  interval  fg,  and  W  *• 
the  probability  that  a  document  falls  in  category  C^,  given 
that  the  document  contains  word  with  a  relative  frequency 
within  the  interval  fg. 

The  probabilities  p^C^)  and  Pj^(^8)  ar®  approximated  by: 
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„  ,f  ,  "if'.) 

pi(V  ■  ~rt— 

n.  . (f  ) 

"  N JtJ 


i 


(U-23) 


where  N^fg)  is  the  number  of  documents  containing  word  Wi  with 

a  relative  frequency  within  the  interval  f  ,  and  n.  ..(f  )  is  the 

8'  ij  s 

number  of  documents  in  category  containing  word  with  a 

relative  frequency  within  the  interval  f  . 

s 

Following  the  previous  analyses,  expressions  for  H.  (f  )  and 

X  s 

Hi(fs)  can  be  written: 


W  ■  -  *  W 108  W 
W  ■  5  w 10  • 


(U-2U) 


By  analogy  to  the  proofs  developed  for  Methods  1  and  2,  M.  (f  )  + 
Hi(f3)  can  be  calculated  where: 


W  ■  <W>. 


w  -  <w\ 

Since,  as  compared  to  Equation  U-ll: 

VV  *  W  ■  -  *  »«(*.)  l-C  Pj 

then; 


(U-2?) 


(U-26) 


<Mi(fa)  ♦  l^f,)).  -  Mi(fa)  ♦  Ri(f8) 
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Therefor*,  as  before: 

*i<V  *  VV  ■  \  *  % 


(U-26) 


One  of  the  major  experimental  problems  is  the  proper  selection 
of  frequency  intervals  to  eraluate.  For  some  areas  of  the  rela¬ 
tive  frequency  spectrum  a  small  change  in  interval  sise  night 
lead  to  a  large  change  in  effectiveness;  for  other  areas  of 
the  spectrum,  however,  changing  the  interval  might  have  a  neg¬ 
ligible  effect  on  effectiveness.  These  intervals  mill  In  gen¬ 
eral  not  be  uniform  over  the  spectrum  and  will  be  different  for 
each  word.  Although  this  selection  and  evaluation  appears  dif¬ 
ficult,  it  will  lead  to  better  category  prediction. 

b«2.1.3  Summary  -  Three  ways  of  using  word  frequency  information 
in  documents  to  predict  document  categories  have  been  indicated.  Based 
upon  earlier  information  theoretical  concepts  of  document  classification, 
this  Information  can  be  evaluated  in  terms  of  its  effectiveness  as  a  clue 
to  document  categories.  It  is  likely  that  the  most  effective  clues  would 
be  found  in  relative  frequency  information— the  ratio  of  due  word  occur¬ 
rence  to  the  total  number  of  document  words.  Once  effective  clues  were 
found,  they  mould  be  used  exactly  Hkm  the  dues  discussed  in  previous 
reports. 
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The  measures  of  effectiveness,  and  M^,  have  been  generalized; 
for  each  ease  the  sum  of  the  averages  of  the  generalized  and  was 
always  equal  to  +  M^.  Thais,  for  the  relative  frequency  cases 

"i‘  «i  -  %(*„)  *  \<fs) 

which  seems  to  indicate  that  and  produces  a  good  average  picture 
of  word  effectiveness. 

The  major  difficulty  with  using  word  frequency  information  is 
the  increase  in  computation  required.  In  addition,  where  frequency  inter¬ 
vals  are  used,  the  choice  of  intervals  must  be  carefully  determined. 
However,  it  is  expected  that  category  prediction  would  be  much  more 
accurate . 

Is. 2. 2  Non-Boo lean  Retrieval  Processes 

Is. 2. 2.1  Introduction  -  In  many  cases  Boolean  search  techniques 
are  inadequate  for  retrieving  information  effectively.  The  objectives 
of  this  section,  therefore,  are: 

(a)  To  explicate  the  concept  of  non -Boolean  retrieval. 

(b)  To  show  the  usefulness  of  non -Boolean  retrieval  processes. 

(c)  To  suggest  the  particular  ways  in  which  non-Boolean  retrieval 
may  be  effected. 

These  concepts  are  presented  in  this  section,  even  though  they  imply  query 
capabilities,  because  of  their  dependence  upon  precise  categorization. 

Host  of  the  presently  operating  retrieval  systems  assume  that  the 
ideal  objective  of  the  information  search  processes  consists  of  retrieving 
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classes  of  documents  corresponding  to  the  descriptor  function  specified 
in  the  request.  Thus,  to  every  Boolean  function  of  descriptors,  there 
corresponds  an  identical  function  defined  upon  the  set  of  classes  of 
documents  to  which  the  descriptors  are  affixed.  For  example,  of  the 
retrieval  request  is  a  •  b — where  a,b  are  descriptors,  and  the  dot,  •, 
signifies  logical  and— the  class  retrieval  would  be  0  that  is, 
the  intersection  of  classes  of  documents  designated  by  the  descriptors 
'a'  and  'b'  respectively.  Yet,  the  effectiveness  of  retrieval  procedures 
based  upon  this  kind  of  correspondence  depends  upon  an  assumption  that 
is  not  necessarily  valid  for  all  Information  retrieval  systems.  The 
assumption  in  question  is:  The  documents  fall  into  categories  or  classes 
unequlvpcably.  In  other  words,  the  document  belongs  to  a  class  of  doc¬ 
uments  with  either  the  probability  1  or  probability  0.  This  section 
proves  that  the  Boolean  retrieval  process  will  not  be  most  efficient,  in 
a  certain  sense,  if  the  assumption  is  not  true. 

U.2.2.2  Inefficiencies  in  Boolean  Retrieval  -  Before  demonstrat¬ 
ing  the  lack  of  effectiveness  of  Boolean  retrieval,  it  would  be  desirable 
to  consider  situations  in  which  probabilistic  class  assignment  could  be 
expected. 

(a)  The  Case  of  Many  Users  -  A  situation  may  occur  where  the  views 
of  users  regarding  membership  of  some  documents  in  a  certain 
category  are  divergent.  Assume,  for  exmaple,  that  there  are 
100  users,  5  categories,  and  10  documents.  Each  user  is  asked 
to  assign  each  document  to  one  or  more  categories.  Table  1 
illustrates  a  possible  set  of  choices.  The  numbers  at  the 
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TABLE  1.  PROBABILISTIC  ASSIGNMENT 


DOCUMENTS 


CATEGORIES 

1 

2 

3 

k 

5 

6 

7 

8 

9 

10 

A 

65 

5o 

75 

80 

25 

0 

0 

15 

30 

U5 

B 

100 

5o 

35 

liO 

60 

25 

50 

75 

25 

0 

C 

90 

8o 

60 

0 

20 

5o 

Uo 

0 

0 

10 

D 

35 

5o 

25 

30 

15 

15 

0 

25 

80 

100 

intersection  of  rows  and  columns  indicate  the  probability  of  a 
document  belonging  to  a  certain  category.  Thus  document  No.  10 
will  belong  to  category  D  with  probability  1,  since  all  the  users 
agree  to  place  it  there.  On  the  other  hand,  the  same  document 
will  have  a  probability  of  zero  of  belonging  to  category  Bj 
again,  all  the  users  agree  to  exclude  it  from  this  category. 

Since  U5  percent  of  the  users  agreed  to  place  document  No.  10 
in  category  A,  it  has  been  assigned  a  probability  of  ,U5. 

(b)  Automatic  Category  Formation  -  Documents  may  be  assigned  to 
categories  in  accordance  with  an  automatic  procedure.  This 
procedure  may  be  intrinsically  probabilistic  in  nature;  that 
is,  a  document  is  assigned  to  a  category  with  probability  p 
depending  upon  the  circumstances  pertaining  to  the  procedure 
of  assignation. 

Assume  now  that  there  is  a  collection  of  documents  and  a  set  of 
non-exclusive  categories.  Let  p^  be  the  probability  that  a  document  d^ 
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belongs  to  the  category  Cj.  For  the  purpose  of  retrieval  the  boundaries 
of  categories  cannot  remain  indefinite.  This  restriction  implies  that  a 
cutoff  point  for  the  probability  should  be  established.  A  document  di 
then  is  considered,  for  a  particular  retrieval  query,  to  be  within  a  cat¬ 
egory  Cj  if  the  probability  of  its  being  in  the  category  is  larger 
than  the  cutoff  point  value,  0.  If  all  documents  belonging  to  the  Inter¬ 
section  of  two  categories  c^  and  are  to  be  retrieved,  then,  assuming 
that  the  probabilities  of  documents  belonging  to  categories  are  inde¬ 
pendent,  the  cutoff  point  Cj  •  will  be  a2.  Thus  it  may  be  expected 
that  some  superfluous  or  extraneous  documents  will  be  retrieved. 

From  the  point  of  view  of  retrieving  the  union  of  classes,  there 
is  a  syaastrlcally  opposite  situation;  some  documents  that  are  relevant 
will  not  be  retrieved.  If  the  cutoff  point  for  the  classes  of  documents 
defined  by  the  descriptors  a  and  b  is  again  the  probability  value  equal 
to  0,  the  probability  of  a  document  belonging  to  the  class  defined  by 
the  union  a  U  b  will  be  2a  -  <? ,  This  quantity,  however,  is  always 
greater  than  0,  since  0*1;  the  proof  is : 

(20  -  02)  -  0  -  0  -  02 

■  0(1  -  0)  (U-29) 

which  must  always  be  positive. 

This  analysis  proves  that  the  standard  of  admissibility  of  a 
document  to  a  class  of  retrieved  documents  cannot  be  maintained  if  the 
Boolean  retrieval  functions  are  used.  The  cutoff  probability  will  be 
lowered  in  case  of  a  retrieve!  criterion  of  logical  Intersection  and 


35 


will  be  raised  in  the  case  of  a  union. 

The  question  remains  as  to  how  the  retrieval  process  is  to  be 
organized  in  order  to  preserve  the  sane  cutoff  point  for  the  results  of 
retrievals  upon  any  request.  In  continuing  the  analysis,  it  is  neces¬ 
sary  to  formulate  an  explicit  goal.  Boolean  retrieval  has  been  proved 
inadequate  in  the  sense  of  not  preserving  the  criterion  of  admissibility. 
The  problem,  therefore,  is  to  find  a  procedure  that  will  permit  the 
retrieval  of  classes  of  documents  satisfying  this  criterion.  The  sim¬ 
plest  system  would  calculate  the  probability  of  a  document  belonging  to 
the  category  specified  in  the  request;  then  the  document  would  be  accepted 
or  rejected  depending  upon  the  value  of  calculated  probability.  However, 
a  system  of  this  nature  may  be  uneconomical  for  the  following  reasons: 

(a)  The  system  would  be  forced  to  scan  documents  with  a  low  proba¬ 
bility  of  belonging  to  a  given  descriptor.  Such  a  procedure  is 
uneconomical  because  the  system  must  scan  through  a  substantial 
portion  of  the  document  collection  for  every  request. 

(b)  The  necessity  of  performing  a  computation  for  each  document 
scanned  to  determine  its  probability  of  belonging  to  the  class 
represented  by  request  may  increase  the  retrieval  time  beyond 
tolerable  limits. 

For  reasons  of  econony,  therefore,  it  may  be  useful  to  introduce 
an  a  priori  fixed  categorisation  that  would  relieve  the  system  of  the 
necessity  of  scanning  the  documents  with  low  probability  values  and  per¬ 
forming  the  attendant  computations. 

This  analysis  has  already  shown  that  the  formation  of  categories 
with  a  fixed  probability  cutoff  point  for  a  given  descriptor  implies  that 
this  criterion  will  not  be  preserved  under  general  retrieval  procedures, 
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which  will  generally  specify  more  complex  logical  functions.  If  some 
concessions  to  economy  are  granted,  the  result  will  be  a  retrieval  process 
that  will  omit  some  desirable  documents  and  yield  sons  undesirable  ones. 
Within  the  framework  of  such  a  situation  there  may  still  be  an  optimum 
solution. 


The  basic  premise  is  that  the  boundaries  of  descriptor  exten¬ 
sions  will  be  fixed  a  priori.  At  the  same  time  these  boundaries,  the 
cutoff  points,  will  be  fixed  in  such  a  way  as  to  maximise  the  value  of 
the  average  retrieval  process  to  the  user.  This  premise  does  not  neces¬ 
sarily  include  the  restriction  that  a  single  cutoff  point  should  be 
established  for  any  descriptor  extension;  instead,  the  number  of  cutoff 
points  should  be  established  a  priori,  whatever  that  number  might  be. 

The  problem  then  resolves  itself  to: 

(a)  Finding  rational  criteria  for  establishing  what  the  user's 
value  of  retrieval  procedures  is. 

(b)  Constructing  a  method  for  deriving  the  values  of  cutoff  points 
that  will  optimize  these  criteria. 

The  rest  of  this  section  presents  an  analysis  of  these  problems. 

U.2.2.3  The  Problem  of  Establishing  Criteria  for  Determining 
User's  Talus  of  An  Average  Retrieval  Procedure  -  With  respect  to  any 
retrieval  request  the  entire  collection  of  documents  may  be  divided  into 
four  subgroups: 

(a)  The  retrieved  documents  that  are  relevant. 

(b)  The  retrieved  documents  that  are  not  relevant. 

(c)  The  unretrieved  documents  that  are  relevant. 
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(d)  The  unretrieved  documents  that  are  not  relevant. 

Since  it  was  assumed  that  the  descriptors  are  assigned  to  documents  on 
a  probabilistic  basis,  all  four  subgroups  will  be  generally  represented 
in  any  retrieval  process. 

Regardless  of  any  special  assumptions,  it  is  clearly  permissible 
to  assert  that  as  the  number  of  documents  in  categories  decreases,  (a) 
and  (d)  increases  and  as  the  number  of  documents  in  categories  (b)  and 
(c)  decreases,  the  value  of  the  retrieved  collection  to  the  user  will 
increase.  Thus, 

V  -  fjft}  -  f2{n}  -  f3(ni)  +  f^Civ}  +  K  (U-30) 

where  V  is  defined  as  the  user  value  of  the  retrieved  collection;  f^, 
f2'  f3*  and  f^  sue  unspecified,  monotonically  increasing  functions;  and 
{i},  {H},  {ill},  and  {17}  are  the  number*  of  documents  in  the  subclasses 
(a),  (b),  (c),  and  (d),  respectively.  K  is  defined  as  a  constant  that 
determines  the  minimal  value  for  the  user  below  which  the  retrieval  is 
not  justified  under  aiy  circumstances. 

For  simplicity,  replace  fp  fg,  ty  and  f^  by  the  constants 
a,  8,  y,  and  6,  and  set  K  ■  0.  The  results  of  the  discussion  are  not 
essentially  modified  by  this  simplification.  Equation  U-30  then  becomes: 

V  -  a{l}  -  8  {II}  -  y{ni}  ♦  5 {17}  (U-31) 

Since  K  -  0,  the  retrieval  proceas  should  proceed  as  long  as  tbs  increment 
of  7,  dV,  is  positive.  That  is,  the  process  may  select  a  group  of  docu¬ 
ments  with  conraon  probability  characteristics  (in  relation  to  the  request 
profile)  and  then  investigate  the  change  of  7  by  including  some  additional 
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documents  with  lower  probability  characteristics .  The  question  as  to 
which  documents  will  be  retrieved  is  the  problem  of  fixing  the  most  advan¬ 
tageous  values  for  the  set  {<r^}  of  cutoff  points  for  the  descriptor  classes. 

The  appropriateness  of  replacing  the  functions  fl»  f2»  f3»  and 
f^  by  the  constants  a,  0,  y,  and  6  rests  upon  the  understanding  of  what 
factors  could  be  responsible  for  non-linearity  of  the  function  7.  Essen¬ 
tially  there  are  two  reasons  why  the  function  V  should  be  non-linear. 

The  first  pertains  to  the  economics  of  using  documents;  the  other,  to 
the  problem  of  redundancy.  In  general,  the  efficiency  with  which  the 
retrieved  collection  is  used  depends  upon  its  size,  even  if  the  value  of 
the  Individual  documents  in  the  collection  is  not  prejudged.  Nevertheless, 
since  retrieval  systems  can  be  used  in  various  ways,  it  is  safe  to  assume 
that  for  many  uses  the  relative  emphasis  placed  upon  the  classes  of 
retrieved  and  unretrieved  documents  remains  unchanged.  To  the  extent 
that  this  assumption  is  true,  the  fact  that  the  function  7  depends  upon 
class  {17},  the  class  of  correctly  unretrieved  documants,  helps  to  remedy 
the  situation. 

The  second  objection  is  more  serious.  Among  the  retrieved  doc¬ 
uments  there  may  be  a  high  degree  of  redundancy;  in  extrema  cases  the 
same  amount  of  information  may  be  covered  more  efficiently  by  a  smaller 
number  of  documents.  It  is  difficult,  however,  to  decide  whether  or  not 
redundancy  is  a  linear  function  of  the  site  of  tbs  retrieved  collection. 

To  answer  this  question  adequately,  it  would  be  necessary  to  formalise 
the  concept  of  redundancy  among  documents  and  then  perhaps  to  formulate 
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theoretical  prescriptions  for  procedures  that  would  permit  the  system  to 
retrieve  the  most  efficient  covering  of  the  topic  specified  in  the  request. 
(This  problem  is  a  difficult  task  in  itself  and  merits  separate  investiga¬ 
tion.  )  Pending  at  least  a  crude  formulation  of  the  theory  of  redundancy, 
this  discussion  will  be  confined  to  the  simplest  assumption  of  linearity. 
Therefore,  given  the  function  V  in  the  form  of  Equation  h-31,  the  first 
task  is  to  find  the  set  of  c-  values ,  the  cutoff  points,  that  would  maximize 
the  user’s  value  for  an  average  retrieval  process. 

h.2.2.k  Method  for  Determining  Cutoff  Point  Values  -  The  follow¬ 
ing  symbols  will  be  used  in  this  exposition: 

=  total  number  of  documents  in  the  collection. 

F. -  total  number  of  documents  belonging  to  the  descriptor  i. 

O.  4» 

n.  (pu)  =  the  number  of  documents  containing  the  descriptor  i 
1  11  within  the  probability  interval  centering  around  K. 

N. (p)  =»  the  distribution  function  for  the  descriptor  i  defined 
on  the  probability  values  as  a  random  variable. 

Vp) =  ni^dp 

p.  ■  the  average  value  of  the  probability  for  the  descriptor 
1  i. 

Pi  '  5^  Vp>p  * 

p.(o)  *  the  average  probability  value  for  the  descriptor  i 
in  the  probability  interval  between  0  and  er. 

.ft* 

The  normalization  factor  is  N^,  not  N^(o). 


i*0 


P^c)  “  flj-  J^1  n±(p)P 

-  the  frequency  with  which  the  descriptor  i  is  used, 
s  ■  the  total  number  of  descriptors. 

0  -  the  Talus  of  terminal  probability  defining  the  boundary 
of  the  class  of  documents  belonging  to  descriptor  i. 

Then,  by  definition: 


Ni(p)  ’  ni^p^ 

* 


Pt(o) 


r±  Q 


>  (U-32) 


In  this  discussion  the  descriptors  are  assumed  to  be  Independent.  To 
facilitate  computation,  the  number  of  documents  in  each  class  are  assumed 
to  be  large  enough  and  the  subdivision  into  the  probability  brackets  fine 
enough  to  permit  Integration  techniques  to  replace  stnaaation. 


The  procedure  for  calculating  the  set  of  o^'s  that  will  maximize 
7  on  pairs  of  descriptors  is: 

(a)  Calculate  the  makers  of  documents  for  the  four  subclasses  of 
documents  that  enter  7  for  an  unspecified  <y^. 

(b)  Obtain  a  general  expression  for  7. 

(c)  Obtain  an  expression  for  the  expectation  value  for  all  7's. 

(d)  Differentiate  the  expression  obtained  under  (c),  and  set  the 
coefficient  of  differentials  equal  to  aero  in  order  to  obtain 
a  set  of  conditions  for  the  maximum. 

(e)  Solve  the  equations  to  obtain  the  values  of  the  e^'s. 
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These  steps  can  now  be  developed  and  expressed  mathematically. 
The  expression  for  the  number  of  documents  containing  the  descriptor  i 
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within  the  probability  interval  centering  around  pv  and  the  descriptor 

*1 


j  within  a  probability  interval  centering  around  p„  is: 


p^  is: 


if 


(1.-33) 


The  probability  of  the  document  being  within  the  two-dimensional  proba¬ 
bility  interval  centering  around  the  values  pv  and  p„  is  the  product 

*1  *2 

of  the  probabilities: 

pO^jKg)  «  Pj^  •  (U-3U) 

By  using  these  equations ,  expressions  can  be  calculated  for  the  four 
classes  of  documents  involved  in  the  function  7: 

(a)  Class  I  -  The  class  of  all  correctly  retrieved  documents: 

-1  -1  n,  (p,,)  •  n.(p.) 

{1}  c  J  J  - y  ^  ^  Pt  Pj  dp±  <*Pj  (U-35) 

°i  °) 

(b)  Class  n  -  The  class  of  all  the  incorrectly  retrieved  documents: 

-1  pi  n.  (p. )  •  n.(p.) 

{H}  “  J  J  -  y  ^  -  J  (1  -  PiPj)dP1  dPj  (U-36) 

®i 

(c)  Class  III  -  The  class  of  incorrectly  unretrieved  documents : 

<m>  ■  J^3  n--)'  B  n-i<Pj)  P,  Pj  «Pi  ""37) 


h2 


(d)  Class  17  -  The  class  of  all  correctly  unretrieved  documents: 


{IV}  - 


n(p. )  •  n(p.) 
N  1 


a  -  ^ 


(U— 36) 


The  retrieval  process  proceeds  until  the  predetermined  cutoff 
point  for  descriptor  i  and  for  descriptor  j  has  been  reached.  To 
retrieve  beyond  this  point  will  be  detrimental,  since  on  the  average  the 
increment  in  7  caused  by  additional  retrieval  will  be  negative. 


The  four  double  integrals  in  Equations  U-35  through  U-38  can 
now  be  evaluated.  For  Equation  U-35: 


{1} 


n(Pj)Pi  Pj  dpt  dpj 


‘if  “(Pi^i^if  n(pj)pj  *»3 

°i  °3 

■ii'iA-ViUVrV'j111  »-»> 

By  using  the  definitions  for  N^(p)  and  N^,  Equations  U-36  through  U-3fi 
become: 


,  ,  A  p1  ni(Pt)  •  Mp) 

tnj.f  f  -1  V,  ~  i..  .  (i  -  ptp3)dpi  dpj 


‘  f  «"lT  -  fjT  -  W1 

*  KJT  "lT^l  -  W1 


(U-UO) 


1*3 


(m, .  ft  f J  nJM  ♦  ^>*1  Pj 

'S*"iT',3T5((’l)5<9J)  (Wa> 

f")  ■  £  ni(Pl>  B  °j(^-  a  -  dPj 

-  J  CM1(o1)  NjCoj)  -  n1t  njt  ?(*,)  P (Oj)]  (U-U2) 


By  substituting  Equations  ii-39,  U-UO,  lx— ill,  and  U-U2  into 
Equation  h-32,  the  function  7  becomes: 

* 

vu  ■  5  ("it  "jt^i  -  51(*1)]  -  W11 

-  §  tflT  -  WjT  * 

-  "jT  "lT®!  ‘  Pi(ol^  tPj  -PjtPj)!)  ' 
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Nov  it  is  possible  to  find  the  values  of  and  Oj  that  will 
maximize  a  specific  Vij*  In  general,  however,  the  values  and  <j^" 
obtained  by  solving  for  maxima  in  expressions  7^  and,  say,  7^  will 
be  different.  This  observation  implies  that  ve  are  looking  for  a  set 
of  values  that  will  maximise  an  average  7^. 

The  average  value  of  T13  is,  of  course,  its  expected  value: 
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and  this  function  will  hare  to  be  maximized.  The  differential  of 
Equation  U-Ul*  is: 

,  ss  r3V.  .  37, ,  "I 

“  "  7  ^  fi  fjl3^  d<Ti  +  ^  (i  **  j) 

or  dE  -  E  ft|"r  f .  T-^-ldo. 
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which  implies  the  following  condition  for  a  maximum: 


(i  -  l,2,...,s)  (U>U6) 


The  partial  derivatives  3V^/3c^  in  Equation  U-U6  can  be  com¬ 
puted  hr  using  Equations  li-39,  h-hO,  U-Ul,  U-U2,  and  U-U3* 
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Performing  the  summations  in  Equation  it— U6  on  Equations  U-U7  through  it— ^0 
results  in: 
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Therefore,  the  following  equations  can  be  solved  for  the  o^'s: 
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for  i  -  1,2,..., s }  where  i  /  J. 


(U-52) 
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In  order  to  get  some  insight  into  the  nature  of  the  solution, 
set  y  -  6  •  0;  i.e.,  the  function  T  depends  only  upon  classes  {i}  and 


{il}.  In  this  case.  Equation  It-55  is  simplified  to: 
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for  i  -  1,2,..., sj  where  i  /  j.  After  rearranging  and  dividing  by  the 
c onion  factor,  n^o^/H,  Equation  U-5>6  becomes: 
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for  i  ■  l,2,...,s;  where  J  i  i. 


The  solution  of  Equation  U-5>7  for  different  i's,  say  i^  and  i^', 
are  almost  identical.  The  solutions  differ  only  by  the  absence  in  the 
summation  on  the  right  hand  side  of  the  term  corresponding  to  i.  To 
demonstrate  this  point  moire  clearly,  redefine  the  summations  in  Equation 
U-57  as: 
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Then,  by  inserting  Equation  U— 56  in  Equation  U— 57  and  adding  back  the 
term  for  j  ■  is 

-  0  g(N)  +  0  fi[Ni(o1)  -  N1T] 

°l  ‘  (a  .  S)g(p)  -  (a  *  s)NiT  --?,<;■)]  ('*-59) 

Since  the  two  g  terms  represent  summations  over  all  values  of  j,  they 
are  identical  for  all  o^'s.  Now,  if  s  is  large,  the  terms  f^CN^o^)  -  N^] 

and  fACp  -  p^)]  are  small  compared  with  g(N)  and  g(p),  respectively. 

The  reason  is  that  with  the  large  number  of  descriptors,  thus  a  large  s, 
the  weights  f^,  which  represent  the  frequency  of  usage  of  descriptors, 
are  all  small  fractions  of  the  order  l/s.  Therefore,  the  values  of  o^'s 
are  approximately  equal.  If  is  multiplied  by  the  denominator  of  the 
expression  on  the  right  hand  side  of  Equation  h-$9  and  sunned  over  all 
i,  then: 
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From  the  definitions  of  the  two  g  functions.  Equation  U-60  becomes: 
soi(a  +  0)g(p)  -  (a  ♦  E^gCp)  -  -  s0g(N)  ♦  0g(N) 
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which  is  equivalent  to  Equation  U-57 


The  minus  sign  in  Equation  U-61  occurs  because  g(N)  is  inherently 
negative.  Each  term  in  the  summation  for  g(N)  is  negative.  Since  N(o) 
is  a  monotonically  increasing  function  of  9,  it  is  now  possible  to  inter¬ 
pret  the  meaning  for  the  value  of  9,  established  in  Equation  U-61. 

It  is  apparent  that  -g(N)  represents  the  average  or  expected 
number  of  retrieved  documents.  On  the  other  hand,  each  term  of  g(p) 
represents  a  product  of  the  average  probability  of  retrieved  documents 
times  the  size  of  the  descriptor  group  normalized  by  the  frequency  of 
usage  of  this  descriptor.  Thus  the  g(p)  function  expresses  the  average 
number  of  retrieved  documents  properly  belonging  to  the  average  descrip¬ 
tor  weighed  by  its  frequency  of  occurrence.  It  is  thus  seen  that  the 
optimum  9,  expressed  by  Equation  U-6l,  is  a  function  of  the  constants 
a  and  0,  which  express  the  relative  importance  attached  to  the  correctly 
and  incorrectly  retrieved  documents}  the  optimum  a  is  also  a  function 
of  two  averages— namely,  g(N)  and  g(p). 

It  is  evident  that  the  higher  the  value  of  0— i.e.,  the  importance 
attached  to  incorrectly  valued  documents— the  higher  will  be  the  value  of 
o.  And  as  a  increases,  fewer  documents  will  be  retrieved.  On  the  other 
hand,  the  higher  the  value  of  a — i.e.,  the  importance  attached  to  the  cor¬ 
rectly  retrieved  documents— the  lower  will  be  the  value  of  9.  For  lower 
values  of  9  more  documents  will  be  retrieved.  The  function  -g(N)  decreases 
with  the  increment  of  value  of  9,  and  so  does  g(p).  When  9*0: 
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To  evaluate  the  expression  for  <7  =  1,  L'Hopital's  rule  must  be  used 
because  of  the  indeterminacy  of  O/Os 
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Thus,  at  o  =  1: 
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Now,  if  the  largest  N within  N.  is  factored  out  of  the 

0*  jmax 


denominator : 
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Since  NjjA  <  1  for  all  j,  it  is  clear  that: 
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Thus,  for  0*1,  Equation  U-61  can  only  be  satisfied  if; 


But  this  result  is  an  impossibility.  This  fact  demonstrates  that  it  is 
never  advantageous  to  admit  only  the  documents  that  belong  to  a  class 
specified  by  a  descriptor  with  certainty. 


The  formulas  derived  in  this  analysis  pertain  to  joint  retrieval 
on  two  descriptors.  Similar  derivations,  although  somewhat  more  complex, 
can  be  carried  out  for  the  arbitrary  joint  retrievals  on  k  descriptors. 
The  task  of  deriving  these  formulas  will  be  continued  in  subsequent 
research  activity. 

Beyond  joint  retrievals  there  loons  a  question  of  retrievals 
specified  in  a  request  by  an  arbitrary  Boolean  function.  Such  problems 
may  be  handled  by  breaking  up  the  arbitrary  Boolean  function  into  a 
canonical  form  of  disjunction  of  conjunctions.  All  that  is  now  neces¬ 
sary  are  formulas  for  calculating  cutoff  probabilities  for  disjunctions. 
This  problem  will  also  be  handled  as  a  part  of  future  activity. 
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U.2.2.5  Conclusions  -  It  is  now  possible  to  outline  the  general 
features  of  a  non-Boolean  retrieval  system.  To  each  descriptor  there 
will  correspond  a  collection  of  classes  of  documents  instead  of  a  unique 
class  of  documents.  Each  class  will  be  determined  by  a  different  cutoff 
point  a.  For  each  document,  there  will  be  two  types  of  cutoff  points, 
disjunctive  and  conjunctive.  Within  each  of  these  categories  an  individ¬ 
ual  a  will  have  its  value  determined  in  accordance  with  the  type  of  joint 
retrieval  it  is  scheduled  to  participate  in.  Thus  there  will  be  one  cut¬ 
off  point  for  the  conjunction  of  two  descriptors,  another  one  for  con¬ 
junction  of  three,  etc.  The  same  principle  holds  for  the  cutoff  points 
for  disjunctive  retrievals.  Any  incoming  request  will  be  transformed 
into  convenient  canonical  form;  for  example,  a  disjunction  of  conjunc¬ 
tions.  The  appropriate  cutoff  points  will  then  be  selected  and  retrieval 
effected. 


In  order  to  calculate  the  cutoff  points,  certain  parameters  are 
required.  These  parameters  can  be  obtained  by  requiring  the  system  to 
perform  bookkeeping  operations  which  will  supply  the  required  data. 
Essentially,  the  kind  of  statistical  data  necessary  for  the  calculation 
of  the  cutoff  points  is: 

(a)  n.(p)  ■  the  "density"  of  documents  pertaining  to  a  given  descrip¬ 

tor  for  a  given  probability  interval. 

(b)  p  .(c)  *  the  average  probability  value  of  a  document  belonging 

to  the  descriptor  i  as  a  function  of  a  cutoff  point. 

(c)  N.(c)  -  the  total  number  of  documents  belonging  to  the  descrip¬ 

tor  i  as  a  function  of  c. 

The  most  fundamental  of  the  three  types  of  data  is  (a),  since  (b)  and  (c) 
can  be  calculated  from  it. 
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U.2.2.6  Priorities  for  the  Future  -  At  this  time  the  most 
important  extensions  of  this  task  appear  to  be: 

(a)  The  derivation  of  values  of  a  for  the  joint  retrieval  of  products 
of  arbitrary  number  of  descriptors. 

(b)  The  derivation  of  values  of  o  for  the  joint  retrieval  of  logical 
sums  of  the  arbitrary  number  of  descriptors. 

The  activity  at  the  next  in  order  of  precedence  will  involve: 

(a)  The  evaluation  of  errors  arising  out  of  approximations  used  in 
the  derivations. 

(b)  The  consideration  of  modifications  arising  out  of  the  removal 
of  the  assumption  of  the  independence  of  descriptors. 

(c)  Considerations  of  an  economic  nature  pertaining  to  the  costs 
involved  in  the  implementation  of  the  non-Boolean  retrieval 
systems  for  different  types  of  applications. 

U.3  QUERY  CAPABILITIES 

Vork  performed  in  this  area  deals  with  an  approach  to  the  generation 
of  extracts  or  abstracts  and  with  the  problem  of  redundancy.  The  relaxa¬ 
tion  of  limitations  on  description  is  dealt  with  indirectly  in  the  preced¬ 
ing  material  on  non-Boolean  retrieval. 

U.3.1  An  Approach  to  a  Criterion  for  Automatically  Generated  Extracts 
Automatic  extracting  was  originally  described  by  Luhn  [l]  some  time  ago. 
While  he  refers  to  the  end  products  of  his  process  as  abstracts,  they  are 
more  accurately  characterised  as  extracts  of  what  are  hopefully  the  more 
central,  critical,  or  descriptive  sentences  in  a  document.  Luhn's  tech¬ 
nique  is  purely  statistical.  Sentences  are  selected  for  extracting  on 
the  basis  of  two  related  facts  about  the^r  word  content: 

(a)  The  relative  frequency  of  the  words  in  the  sentence,  except  for 
common  words. 
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(b)  The  distance  between  high  frequency  words  in  the  sentence,  based 
upon  the  number  of  intervening  non-olue  words. 

While  Luhn  presents  a  rather  vague  theoretical  rationale  for  the 
validity  of  such  an  approach,  there  is  no  attempt  to  justify  it  in  detail, 
except  on  the  grounds  that  it  can  produce  useful  extracts.  No  attempt  is 
made  to  show  whether  extracts  generated  by  any  other  technique  are  more 
«'r  less  useful.  Recently  Quiliano,  et  al  [2],  at  Arthur  D.  Little  have 
proposed  a  technique  for  incorporating  syntactic  information  into  the 
distance  measure  in  order  to  make  the  technique  more  useful. 

There  seem  to  be  two  things  lacking  in  this  approach  to  automatic 
abstracting  or  extracting! 

(a)  A  lack  of  any  criterion  or  perhaps  of  multiple  criteria,  depend¬ 
ing  on  the  context  in  which  the  extract  is  to  be  used,  for  deter¬ 
mining  the  adequacy  of  any  given  extract  or  extracting  scheme. 

(b)  A  lack  of  understanding  of  the  fundamental  processes  involved 
in  human  abstracting,  extracting,  condensation,  or  perception 
of  statement  salienoy  in  a  longer  argument  or  presentation. 

It  would  seem  that  a  combination  of  the  approach  of  Newell  and  Simon 
[3]  to  the  simulation  of  cognitive  processes— theorem  proving  and  problem 
solving  more  generally— and  the  approach  of  Mar  on  [U]  to  the  automatic 
classification  of  documents  might  be  appropriate.  While  each  of  these 
studies  is  well  known,  it  might  be  appropriate  to  indicate  briefly  which 
aspects  of  their  methodology  are  relevant  to  alleviating  the  two  short¬ 
comings  in  present  automatic  extracting  systems. 

Newell,  et  al,  in  order  to  simulate  cognitive  functioning,  firs t 
used  a  method  of  observation  and  introspection  to  gain  insight  into  the 


method  by  which  humans  proved  logic  theorems.  In  the  context  of  Information 
retrieval  the  major  emphasis  Is  on  useful  extraction  rather  than  on  the 
simulation  of  human  extraction.  It  may  nevertheless  pay  to  observe  human 
extracting  behavior  In  order  to  develop  more  useful  algorithms  for  obtain¬ 
ing  automatic  extracts. 

The  work  of  Karon  and  Kuhns  has  already  been  described  in  previous 
reports.  It  involved  the  use  of  human  classification  of  a  set  of  items 
as  a  criteria  for  automatic  classification.  The  automatic  classifica¬ 
tion,  however,  was  not  based  on  the  unknown  techniques  of  the  human 
classifiers .  The  automatic  algorithm  was  based  rather  upon  purely  sta¬ 
tistical  features  of  some  of  the  classified  documents.  Human  classifica¬ 
tion  was  also  available,  however,  to  provide  the  criteria  far  checking  the 
adequacy  of  the  automatic  algorithm  once  it.  was  derived. 

In  the  oase  of  automatic  extracting  both  of  these  techniques  might 
prove  useful.  That  is,  the  use  of  observation  and  introspection  would 
help  alleviate  the  difficulty  caused  by  the  lack  of  understanding  of 
human  functions  and  allow  for  the  development  of  more  rational  extract¬ 
ing  algorithms.  Perhaps  these  techniques  could  be  ultimately  extended 
to  abstracting  per  se.  The  records  of  humanly  generated  extracts  could 
be  used  as  a  criterion  for  evaluating  the  adequacy  of  various  automatic 
algorithms.  The  latter  would  alleviate  the  difficulty  caused  by  the 
non-existence  of  suitable  criteria. 

The  paradigm  for  such  research  and  development  would  be  as  follows: 
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(a)  A  series  of  documents,  either  larger  texts  or  shorter  articles 
for  research  convenience,  would  be  selected  for  extracting. 

(b)  Ground  rules  for  desired  extracts  would  be  developed;  e.g.: 

(1)  How  long  should  each  extract  be?  Should  it  be  some  fixed 
proportion  of  the  total  document? 

(2)  What  sentential  units  should  be  extracted?  Whole  sentences 
only?  Parts  of  sentences?  Parts  that  can  be  recombined 
to  form  larger  sentences? 

(3)  What  is  the  focal  purpose  of  the  extract?  To  extract  as 
much  factual  information  as  possible  within  the  limits 
imposed  by  the  length  of  an  extract?  To  characterize  the 
document  as  well  as  possible  in  order  that  the  reader 
might  know  what  information  it  contains?  Both  of  these? 

(U)  What  information  or  techniques  may  be  used  in  generating 
the  extract?  Anything  that  occurs  to  the  user  based  upon 
his  total  knowledge?  Anything  based  on  the  explicit  and 
implicit  content  of  the  document?  Only  explicit  content? 
Only  rigorously  formulated  rules? 

(c)  The  documents  would  then  be  subjected  to  human  extracting  using 
instructions  based  upon  the  ground  rules . 

(d)  A  portion  of  the  humanly  extracted  documents  would  be  carefully 
subjected  to  introspective  report  and  an  analysis  of  the  implicit 
rules  followed  n  extracting. 

(e)  Based  on  this  analysis,  one  or  several  automatic  algorithms 
would  be  developed  for  achieving  essentially  the  same  extracts 
from  readily  treated  information  in  the  documents.  For  the  sake 
of  generality,  an  attempt  would  also  be  made  to  incorporate  those 
rules  manifest  in  introspective  protocols  that  could  be  handled 
by  computers. 

(f)  Measures  of  correspondence  between  humanly  and  automatically 
generated  extracts  would  then  be  developed. 

(g)  Finally,  the  automated  techniques  would  be  applied  to  the  remain¬ 
ing  documents  in  the  sample  and  the  extracts  generated  would  be 
validated  against  the  criterion  of  the  human  extracts  already 
available. 


While  this  approach  depends  upon  research  and  development  strategies 
already  developed  by  others,  its  application  to  the  information  retrieval 
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problem  ie  unique.  It  would  probably  be  unwise  to  embark  on  a  specific 
program  of  this  kind  in  the  remaining  part  of  this  project,  but  further 
research  along  these  lines  seems  unwarranted. 

ii.3.2  The  Problem  of  Redundancy  in  Information  Retrieval  Systems 

U.3.2.1  Introduction  -  Redundancy  in  the  information  retrieval 
processes  occurs  whenever  the  retrieved  data  is  duplicated.  To  avoid 
redundancy  is  important,  not  only  for  the  rather  obvious  economic  reason, 
but  also  for  operational  and  logical  reasons.  Theoretical  considerations 
pertaining  to  the  nature  of  measures  for  removing  redundancy  will  be  best 
understood  within  the  context  of  a  more  detailed  discussion  of  the  unde¬ 
sirability  of  duplication  from  these  three  points  of  view. 

U.3.2.2  Economic  Point  of  View  -  For  some  types  of  information 
retrieval  systems  the  cost  of  retrieval  may  become  prohibitively  high, 
especially  if  all  the  data  pertaining  to  the  request  profile  is  retrieved. 

The  use  value  of  the  information  contained  in  the  retrieved 
data  may  be  drastically  reduced  by  the  existence  of  redundant  material. 
Effectively  the  user  of  the  data  is  swamped  by  repetitious  information . 

U.3.2.3  Operational  Point  of  View  -  Many  information  retrieval 
systems  enter  into  larger  systems  as  component  units.  The  retrieved  data 
may  form  an  input  to  other  processes  such  as  control,  command  and  control, 
or  real-time  monitoring.  The  occurrence  of  redundant  material  may  not 
only  reduce  the  efficiency  of  the  functioning  of  the  system,  but  may  also 
affect  the  outcome  of  the  processes  to  which  the  retrieved  data  forms  an 
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input.  For  example,  imagine  a  system  that  is  required  to  perform  some 
statistical  tabulations  on  the  incidence  of  car  accidents  among  various 
population  groups.  Furthermore,  assume  that  the  reports  on  automobile 
accidents  are  incoming  from  diverse  sources  so  that  some  accidents  may¬ 
be  reported  more  than  once.  Under  such  conditions  it  will  be  necessary, 
in  order  to  obtain  valid  results,  to  introduce  some  filtering  stage 
that  will  prevent  or  eliminate  duplication.  Estimates  of  the  reliability 
of  the  results  obtained  will  in  general  depend  upon  the  effectiveness  of 
the  filtering  stage.  The  removal  of  data  redundancy  is  thus  vital  to  the 
satisfactory  performance  of  the  system  as  a  whole. 

U. 3.2.1*  Logical  Point  of  View  -  In  the  process  of  decision 
making  the  origin  of  the  data  may  be  as  relevant  to  the  decision  as  its 
content.  It  is  even  conceivable  that  the  existence  of  large  redundancy 
in  the  collected  data  may  be  one  of  the  important  factors  influencing 
the  nature  of  the  decision.  In  other  words,  the  decision  process  may 
be  dependent  on  the  manner  in  which  the  data  is  presented.  As  an  example, 
imagine  a  system  whose  task  it  is  to  solve  transportation-routing  problems. 
The  kind  of  solution  employed  may  well  depend  upon  the  complexity  of  a 
particular  problem.  If  the  particular  transportation  network  contains 
many  nodes,  the  system  will  use  one  type  of  an  algorithm*  if  it  contains 
few  nodes,  then  another. 

Determining  the  nature  of  the  problem  may  depend  upon  sampling 
of  data;  thus  inaccuracies  will  arise  if  the  data  contains  a  large  amount 
of  redundancy.  Such  a  situation  is  particularly  prone  to  arise  if  the 
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system  schedules  its  own  operations  and  batches  many  problems  together. 


Considering  several  ways  in  which  the  concept  of  redundancy  ie 
implicated  in  the  information  retrieval  processes,  one  observes  a  basic 
dichotomy: 

(a)  Some  of  the  redundancy  problems  require  the  exact  scrutiny  of 
the  individual  data  items.  If  data  items  are  conventionally 
thought  of  as  documents,  then  a  sort  of  redundancy  map  could 
be  obtained  by  indicating  the  relationship  with  respect  to  the 
redundancy  of  each  document  to  every  other  document  in  the  col¬ 
lection.  The  simplest  kind  of  relation  between  documents  with 
respect  to  redundancy  is  that  of  inclusion;  that  is,  one  doc¬ 
ument  may  express  everything  that  another  document  expresses 
with  respect  to  a  given  topic.  Another  possible  relation, 
although  a  less  simple  one,  is  that  of  overlap.  A  document  may 
partially  express  the  content  of  another  document  with  respect 
to  a  given  topic  with  some  numerical  measure  of  the  partial 
covering. 

(b)  It  may  be  possible  or/and  desirable  to  handle  the  problem  of 
reducing  redundancy  on  an  aggregate  level.  The  distinguishing 
feature  of  this  approach  is  the  statistical  handling  of  infor¬ 
mation  contained  in  the  documents.  It  is  important  to  remember 
that,  since  the  primary  concern  is  redundancy,  the  basic  meas¬ 
ure  of  information  must  be  relative  rather  than  absolute.  That 
is,  such  a  measure  when  applied  to  a  document  should  be  able  to 
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determine  the  expected  number  of  documents  rendered  superfluous 
by  the  document  in  question;  alternatively,  the  measure  should 
indicate  how  many  documents  render  a  given  document  superfluous. 

Usually  a  document  will  cover  a  number  of  topics.  In  general, 
it  must  be  expected  that  the  redundancy  measure  will  not  be 
evenly  distributed  among  all  the  topics  that  a  given  document 
deals  with.  Thus  with  respect  to  one  topic  a  document  may  be 
highly  unique,  whereas  with  respect  to  another,  highly  redundant. 
Whether  or  not  it  is  advisable  to  average  the  redundancy  meas¬ 
ure  over  all  topics  or  handle  them  separately  is  a  question  that 
may  be  decided  only  after  a  more  detailed  and  rigorous  study. 

It  is  also  possible  that  this  question  admits  no  unique  answer, 
since  information  retrieval  systems  axe  highly  differentiated 
with  respect  to  their  functional  characteristics. 

It  would  be  incorrect  to  assume  that  this  dichotomy  represents 
two  alternative  approaches.  It  is  quite  unrealistic  to  expect  that  an 
exhaustive  redundancy  map  comprising  the  detailed  breakdown  of  all  rela¬ 
tions  among  all  documents  individually  is  feasible.  Practically,  some 
sort  of  statistical  approach  is  necessary.  It  is  necessary,  however,  to 
demand  that  any  statistical  averages  employed  to  reduce  redundancy  capture 
the  true  statistical  properties  of  a  system  based  upon  the  requirements 
for  a  redundancy  map. 

U.3.2.5  Conclusion  -  In  conclusion,  two  tentative  examples  of 
redundancy  measures  are  given: 


60 


(a)  Each  document  is  characterised  by  a  set  of  numbers  expressing 
the  percentage  of  documents  containing  more,  or  less,  Informa¬ 
tion  concerning  a  given  topic. 

(b)  Each  document  is  characterised  by  a  set  of  numbers  expressing 
the  additional  contribution  that  the  document  would  make  to  the 
given  topic,  assuming  the  average  number  of  documents  already 
retrieved. 

h.h  PROCESSING  CAPABILITIES 

Work  in  this  area  has  been  primarily  concerned  with  organisation  and 
search  procedures.  No  new  progress  has  been  made  on  the  problem  of  asso¬ 
ciative  techniques. 

U.ii.l  Coaparative  Analysis  of  Some  File  Organisations 

U.U.1.1  Introduction  -  This  section  contains  a  discussion  of 
a  number  of  file  organisations  that  may  be  suitable  for  the  retrieval  of 
documents  or  other  items  of  information.  The  exposition  largely  follows 
the  order  of  mathematical  development  rather  than  some  didactic  organisa¬ 
tion  for  easily  communicating  the  results.  This  method  of  exposition  is 
used  because  it  is  impossible  in  work  of  this  kind  to  know  at  the  begin¬ 
ning  where  fruitful  mathematical  analysis  will  lead. 

For  each  file  structure  considered,  expressions  are  derived  for 
the  average  or  expected  values  of  the  number  of  items  and  the  subject  or 
category  headings  examined  to  retrieve  a  single  item,  known  to  be  in  the 
file,  in  response  to  a  request.  The  file  organizations  are  then  compared 
and  evaluated  in  terms  of  these  expected  values  for  a  wide  range  of  file 
sizes.  To  aid  in  the  comparison,  variances  are  derived  and  plotted. 
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Three  different  types  of  file  organizations  or  structures  will 
be  compared.  They  are: 

(a)  Single-level  subject  headings. 

(b)  Hierarchical  trees  of  items. 

(c)  Hierarchical  trees  of  subject  headings. 

The  first  type  consists  of  a  single  level  of  unrelated  subject  headings 
or  category  names  under  which  items  are  grouped  or  filed  in  a  linear 
sequence,  in  alphabetical  card  file  is  an  example.  The  subject  headings 
in  this  example  are  simply  the  letters  of  the  alphabet. 

The  second  type  of  file  organisation  is  a  multi-level  tree  of 
items  that  are  connected  by  the  tree  structure.  This  connectivity  does 
not  necessarily  Imply,  however,  a  corresponding  logical  relation  among 
these  items. 

The  tree  of  subject  headings,  on  the  other  band,  is  a  multi¬ 
level  categorization  of  subject  headings  where  each  heading  is  divided 
into  two  or  more  sub-headings  down  to  the  lowest  level  of  detail.  Hie 
tree  of  subject  headings  1s  intended  to  imply  the  logical  relation  among 
them.  3h  this  type  of  file  it  is  assumed  that  the  item  are  filed  in  a 
linear  sequence  or  in  a  hierarchical  tree  under  the  last  row  of  headings. 

More  than  one  way  of  searching  the  nodes  of  a  tree  will  be  used. 
Further  subdivisions  of  the  three  types  of  file  organizations  will  be 
discussed  in  the  following  detailed  analysis.  Trees  of  both  items  and 
subject  headings  will  be  considered,  in  various  cases,  in  the  section  on 
hierarchical  trees.  First,  however,  single  level  subject  headings  will 
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be  analyzed.  This  analysis  will  include  the  case  of  a  sequentially 
ordered  file  which,  when  searched  logarithmically,  makes  the  transition 
between  single  level  subject  headings  and  hierarchical  trees  one  of 
generalizing  a  special  ease. 

For  each  type  of  file  structure  a  mathematical  expression  can 
be  derived  for  the  expected  number  of  headings  and  items  searched  and 
examined  in  order  to  locate  a  single  item  in  the  file.  Same  simplify¬ 
ing  assumptions  will  be  made  to  keep  the  mathematics  relatively  uncompli¬ 
cated.  Similar  expressions  can  be  derived,  however,  under  less  restric¬ 
tive  assumptions. 


lulul.2  Single  Level  Subject  Headings  -  Suppose  there  are  s 
subject  headings.  It  is  assumed  that  the  subject  heading  under  which 
the  item  is  to  be  found  is  supplied  with  the  request.  It  is  further 
asstned  for  the  sake  of  simplicity  that  the  items  in  the  file  are  evenly 
distributed  under  the  subject  headings.  That  is,  it  is  equally  likely 
that  any  subject  heading  and  any  item  under  a  subject  heading  will  be 
requested  and  each  subject  heading  will  have  the  same  number  of  items 
filed  under  it.  The  probability  p1  of  searching  one  subject  heading  is: 

Pi  "  j  (U-68) 

The  probability  of  searching  two  subject  headings  to  find  the  requested 
one  is: 


p2 


(k-69) 


Similarly: 
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(U-70) 


pi  “s 


The  expected  number  E(i)  of  subject  headings  searched  is: 
s 


E(i)  -Sir 


i-1 


8 


1  s(s  +  1) 

s  2 


or 


s+1 


The  number  of  items  N  under  each  subject  heading  is: 

8 


N 


N 

8  8 


(U-71) 


(U— 72) 


(U— 73) 


By  an  argument  analogous  to  that  for  subject  headings,  the  expected  number 
E(i)  of  items  searched  is: 


E(i) 


N8  « 

S  Jf 
i-1  "s 


N^+  s 


(U-7U) 


The  expected  number  of  items  and  subject  headings  searched  for  in 
a  linear  file  is  then: 

E  -  5_JLi  + 


J(s  +  N/s  +  2) 


(U-75) 


A  file  of  items  arranged  sequentially  by  some  ordering  rule— 
e.g.,  a  file  of  part  or  drawing  numbers  or  any  other  numbered  or  ordered 


6U 


items— can  be  arranged  and  searched  by  the  method  of  subject  headings 
previously  described.  Another  method  of  search  is  the  following:  Go 
to  the  middle  of  the  file.  Compare  the  item  requested  with  the  item 
there.  A  decision  can  then  be  made  on  the  basis  of  the  ordering  of  the 
items  as  to  whether  the  item  sought  is  in  the  first  (lower)  half  of  the 
file  or  in  the  second  (higher)  half.  Whichever  half  it  is  in,  go  to 
the  middle  of  that  half  and  repeat  the  procedure.  This  process  is  con¬ 
tinued  until  the  item  is  located.  The  process  of  going  to  the  middle  of 
any  portion  of  the  file  will  be  called  a  cut.  Since  a  single  file  item 
is  examined  for  each  cut,  the  expected  number  of  cuts  is  equal  to  the 
expected  number  of  file  items  which  will  be  examined.  This  method  is 
called  the  Binary  Logarithmic  search. 

Consider  a  file  of  N  items.  By  the  search  proced»r<*  just 
described,  the  number  of  items  N,  that  can  possibly  be  retrieved  on  the 
first  cut  is  1,  on  the  second  cut,  2;  and  in  general  on  the  j^  cut: 

N3  -  2 J’1  (U-76) 

The  maximum  nunber  of  cuts  n  required  to  retrieve  any  item  whatsoever  in 
the  file  can  be  determined  from  Equation  U-76  as  follows: 
n 

N  -  E  N. 
j-1  3 

•  £ 

J-1 

-  2n  -  1  (U— 77) 

Solving  Equation  U-77  for  n  gives: 
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n  -  logg (N  +  1) 


(U-78) 


The  origin  of  the  name  logarithndo  eearoh  is  obvious  from  Equation  U-78. 


It  is  evident  from  Equation  U-?6  that  the  probability  of 


retrieving  the  correct  item  in  response  to  a  given  random  request  on  the 
od-1 


^  cut  is: 


'3 


(U- 79) 


The  expression  for  the  expected  number  of  cuts  3  (or,  equivalently,  the 
number  of  items  examined)  is: 


n 


E  -  I  3 

3-1 


,3-1 


(U-80) 


where  n  is  obtained  from  Equation  U-78.  Tbs  series  in  Equation  U-80  is 
the  derivative  of  a  geometric  progression,  and  the  expression  for  its 
sum  can  be  obtained  by  differentiating  the  expression  for  the  sum  of  a 
geometric  progression  with  a  finite  number  of  terms.  This  procedure 
yields  the  following  expression  for  E: 
rN  +  li 


E 


C — — ]  logg(N  +  1)  -  1 


(U-81) 


U.U.1.3  Hierarchical  Trees  -  Only  regular  rooted  trees  will  be 
considered  for  hierarchical  trees.  A  tree  is  rooted  if  all  its  branches 
are  connected  ultimately  to  a  single  node  (the  root).  A  tree  is  regular 
if  the  number  of  branches  k  emanating  from  each  node  is  a  constant. 
Another  way  of  thinking  of  this  file  structure  is  that  every  beading  or 
grotping  of  the  file  organization  is  divided  into  the  same  number  of 
subheadings. 


( 
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Four  cases  of  retrieving  items  from  trees  win  be  considered. 
These  cases  are  designated  I  to  IV,  respectively. 

U.lt.1.3.1  Case  I  -  In  this  case  the  tree  is  considered  as  a 
hierarchy  composed  entirely  of  file  items,  each  of  which  is  equally 
likely  to  be  the  answer  to  a  given  random  request.  Hence,  retrieving 
a  given  node  will  be  considered  as  providing  a  single-item  response. 

The  level  of  the  node  then  represents  the  generality  of  the  response, 
which  is  presumably  related  directly  to  the  generality  of  the  request. 
The  node  provided  as  a  response  can  be  considered  as  the  name  or  term 
or  descriptor  for  all  the  nodes  at  lower  levels  of  the  tree  that  are 
connected  to  the  node  provided  as  a  response.  If  the  node  is  a  category 
name,  all  the  connected  nodes  — the  items  in  the  category — could  be  pro¬ 
vided  as  part  of  the  response.  It  le  assvned  that  the  tree  is  indexed; 
that  le,  each  node  of  the  tree  contains  indexes  of  the  nodes  on  the  next 
lower  level  oomneoted  to  it.  It  is  also  assueed  that  these  indexes  are 
sufficient  to  ascertain  which  node  to  examine  at  the  next  level.  Thus 
only  one  node  is  examined  at  each  level  searched. 

If  each  node  of  the  tree  contains  Indexes  that  are  identifiers 
of  the  nodes  at  the  next  level  at  the  end  of  the  branches  emanating  from 
it,  then  by  examining  a  given  node  a  decision  can  be  made  as  to  which 
node  to  examine  at  the  next  level.  Searching  a  tree  of  this  type  is  a 
generalisation  of  the  binary  logarithmic  search.  For  example,  consider 
a  regular  binary  tree;  that  is,  k  -  2.  Examining  the  first  node,  the 
root,  is  analogous  to  going  to  the  middle  of  the  file.  There  are  two 
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nodes  at  the  next  level.  Selecting  one  is  analogous  to  going  to  the 
middle  of  the  lower  half  of  the  file?  selecting  the  other  is  equivalent 
to  going  to  the  middle  of  the  upper  half  of  the  file.  The  generaliza¬ 
tion  of  this  process  for  larger  integral  values  of  k  is  obvious.  The 
mathematics  is  analogous  to  the  binary  logarithmic  search. 

The  number  of  levels  L  to  be  examined  in  order  to  guarantee  the 
retrieval  of  any  item  in  a  regular  tree  of  order  k  is: 

L  -  logkC(k  -  1)N  +  l]  (U-82) 

The  expected  number  of  items  examined  becomes: 

i-S  i  ^ 

N  J-i 

-  log^k  -  1)N  +  1]  -  -jpi-j  (U-83) 

where  L  is  determined  from  Equation  U-82.  Thus  Equations  U-78  and  U-81 
are  merely  special  cases  of  Equations  U-82  and  U-83,  respectively,  for 
regular  binary  trees. 

U.U.1.3.2  Case  n  -  In  this  case  only  the  nodes  at  the  bottom 
level  of  the  tree  represent  file  items.  It  is  assumed  that  each  such 
node  represents  a  group  of  file  items .  Thus  a  search  consists  of  tracing 
a  path  through  the  tree  to  one  node  at  the  bottom  and  searching  the  items 
filed  under  that  node  to  provide  a  single  file  item  as  a  response.  Again, 
it  is  assumed  that  each  node  is  equally  likely  to  be  the  answer.  If  this 
case  is  restricted  to  regular  trees  with  no  method  of  indexing  or  deter¬ 
mining  which  connected  node  at  the  next  level  is  the  correct  one,  then 
this  case  generalizes  the  simple  subject  heading  file  to  a  multi-level 
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subject  heading  or  classification  file.  Only  non-indexed  trees  will  be 
considered  in  this  case.  A  non-indexed  tree  is  one  that  has  no  mechanism 
for  selecting  the  proper  node  at  the  next  lower  level  without  examining 
the  nodes  at  that  level  connected  to  the  node  at  which  the  searcher  is 
presently  located. 

Assume  there  are  s  nodes  or  subject  headings  on  a  regular  tree 
of  order  k.  Then  let  there  be  N  file  items  listed  under  the  bottom  nodes 
and  assume  that  the  file  items  are  evenly  distributed  among  these  nodes* 
Assume  also  that  there  are  L  levels  of  nodes  in  the  tree. 


Since  tbs  only  nodes  searched  at  each  level  are  those  connected 
to  the  node  selected  at  the  next  higher  level,  the  probability  p^  of 
finding  the  desired  subject  heading  at  a  given  node  is: 


Pj  ■  |  (U-8U) 

Therefore,  the  expected  number  of  nodes  examined  at  any  level  j,  except 
the  first  level  or  the  root  node*  where  the  expected  number  is  1,  is : 


Ej(i) 


k  +  1 


(U-85) 


where  2  s  j  *  L.  Hence,  the  expected  number  of  nodes  examined  for  the 
entire  tree  including  the  root  node  is: 


*It  is  assumed  that  this  node  is  examined  to  identify  the  tree  and  locate 
the  nodes  at  the  second  level. 
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(U— 86) 


The  required  number  of  levels  L  in  the  tree  is  determined  hy  k  and  s,  and 
is  obtained  from  Equation  4-82,  which  gives  s 

L  -  logkC(k  -  l)s  +  l]  (U-8?) 

Substituting  Equation  U-8?  into  Equation  U-86  and  simplifying: 

Es  .  [* logj^k  -  l)a  +  1]  ♦  ^  (U-88) 


At  this  stage,  no  file  items  have  been  examined.  Equation  U-88 
gives  the  expected  number  of  subject  headings  examined  to  find  the  head¬ 
ing  at  the  lowest  level  under  which  the  file  item  sought  is  listed. 
Therefore,  the  file  items  under  that  heading  must  now  be  examined.  The 

number  of  itemb  N  filed  under  a  given  subject  heading  is: 

8 

N  -  ~  (U-89) 

8  ®L 


where  s^  is  the  number  of  subject  headings,  or  nodes,  at  the  lowest  level 
of  the  tree.  This  sequence  is  a  simple  linear  file  like  the  first  one 
examined.  The  expected  number  of  file  items  searched  En  is  then: 


E”(i> '  A  t 


Ks  .1 


(U-90) 


The  number  of  nodes  s^  at  level  j  of  a  regular  tree  of  order  k  is  given 
by: 

Sj  -  ir3"1  (U-91) 
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therefore} 


sL  “  k- 


L-l 


(U-92) 


Substituting  Equation  U-87  into  Equation  U-92  yields: 

and  from  Equations  U-89  and  it— 93  > 


N 


k» 


s  (k  -  l)s  +  T 

Substituting  Equation  U-9U  in  Equation  U-90  gives: 

B  _  »  ♦  (k  -  l)s  ♦  1  . 

Bn  2t(k  -  l)s  +  l] 


(U-93) 


(U-9U) 


(U-95) 


The  expected  value  of  the  number  of  subject  headings  and  file 
items  examined  to  retrieve  one  file  item  in  this  type  of  file  organisa¬ 
tion  is  Equation  U-88  plus  Equation  U-9J>: 

.  _  kH  ♦  (k  -  l)s  ♦  1 

*  -2'[fr'  1] 


♦  log^Kk  -  l)a  ♦  l]  * 


1  -  k 


(U-96) 


It  is  now  evident  that  when  file  items  are  related  it  may  be 
possible  to  arrange  each  set  of  N  items  so  that  it  can  be  searched 

9 

logarithmically .  In  this  case  Equation  U-96  becomes: 

E '  [~or-T)y  ,L] logkC(k  - 1)  ^  + 1]  - j—x 

+  £-£-!]  logkC(k  -  l)s  +  1]  +  (U-97) 

Equation  U-97  is  obtained  from  Equations  U-83,  U-88,  and  U-89.  Equation 
U-93  was  used  to  obtain  the  value  of  s^. 
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U.U.1.3.3  Case  III  -  nils  case  is  the  same  as  Case  I  except 
that  the  tree  is  not  Indexed.  That  is,  any  node  may  be  a  satisfactory 
response  to  a  request;  but  after  selecting  a  node  at  a  given  level,  it 
is  necessary  to  examine  the  nodes  at  the  next  lower  level  connected  to 
the  selected  node  in  order  to  ascertain  which  one  is  the  next  appropriate 
subheading. 


In  this  case  the  maximum  number  of  nodes  examined  at  each  level 
except  the  first  is  simply  k.  The  number  of  nodes  examined  at  the  first 
level  is  1.  Therefore,  the  maximum  number  of  nodes  examined  in  any  search 


n  -  k(L  -  1)  +  1 

hence,  from  Equations  U-82  and  U— 98 : 

n  -  k  log^k  -  1)N  +  1]  +  (1  -  k) 

Therefore,  the  expected  nunber  of  nodes  examined  is: 


(U-98) 


(U-99) 


n  i 
E  -  E  i 

i-1  n 


|  logkC(k  -  1)N  +  1]  + 


2  -  k 


(U-100) 


where  n  is  determined  from  Equation  U -99. 


U.U.1.3.U  Case  17  -  This  case  considers  an  indexed  tree  of 
subject  headings  rather  than  file  items  with  the  file  items  located  under 
the  lowest  row  of  nodes  or  subject  headings.  The  equally  likely  assump¬ 
tion  is  involved,  as  usual.  Two  variations  can  be  considered.  First, 
the  file  items  are  sequential  and  searched  in  order.  Second,  the  file 
items  are  searched  logarithmically;  in  this  variation  the  items  are 


72 


ft 


actually  filed  in  a  tree  structure. 

Since  the  subject  headings  in  this  case  are  not  responses,  the 
expected  number  of  headings  examined  is  fixed  and  equal  to  the  number  of 
levels  L  in  the  tree.  Therefore,  from  Equation  U— 87 s 

Es  "  “  1>8  *  1]  (U-101) 

For  a  sequentially  searched  file,  the  expected  number  of  items  searched 
is  obtained  from  Equation  U-95.  Therefore,  the  expected  number  of  sub¬ 
ject  headings  and  items  searched  is:  - 

*  ■  It?*1-  $1  h1  *  l0*kt(k  *  «•  * 1]  tt-102) 


If  the  items  are  searched  Logarithmically,  the  expected  number 


is  obtained  by  taking  N  equal  to  N  and  then  substituting  Equation  U-9U 

B 

in  Equation  U-83.  The  resulting  equation  is: 


**  -  **  [<k  ~  u  w  i 


i 

•FTT 


(U-103) 


Therefore,  the  expected  mrnber  of  subject  headings  and  items  examined 
ls  Equation  U>101  plus  Equation  U-103: 

E  -  log^k  -  l)s  +  1] 

*  [ft  -  nr.  a?  *  x3  **  [ft-^-^3 

-  r;T  (U-ioli) 


U.U.lJ:  Analysis  and  Comparison  of  the  Expected  Values  -  The 
major  purpose  of  deriving  expressions  for  the  expected  values  of  the 
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number  of  headings  and  items  examined  in  various  file  structures  is  that 
these  values  provide  a  convenient  (if  oversimplified)  means  of  comparing 
the  effectiveness  of  different  file  structures.  These  file  organizations 
and  their  corresponding  average  values  are  summarized  in  Table  2. 


For  general  purposes  of  comparison  the  equations  identified  in 
Table  2  can  be  rewritten  in  simpler  form.  The  simplified  versions  are 
given  belov  with  their  original  numbers  followed  by  "A" .  The  subscript 
s  stands  for  subject  headings  j  N  for  file  items. 

E  -  |  [s  +  N/s  +  2]  ■  +  -S— — 


(U-75A) 


where  N  is  obtained  from  Equation  li-73. 
s 


E  -  Tv  - 


FTT 


C**r-rr)  (k-w 


where  1^  -  n  is  obtained  from  Equation  U-82. 

E  ■  (L.  - x)  * 1  *  -H- 

where  L  is  obtained  from  Equation  k-87}  N  .  from  Equation  U-9U. 

S  8 


(U-96A) 


E  :  (Ls  -  1)  +  1  +  Lj,  -  (n*  )  (U-97A) 

8 

where  Lg  and  are  obtained  from  Equation  l*-87j  Ng,  from  Equation  1*-9U. 


E  -  |(I^  -  1)  +  1 


(U-100A) 


where  •  n  is  obtained  from  Equation  U-82. 

N_  +  1 


E  -  Ls  + 


(U-102A) 


where  L  is  obtained  from  Equation  U-87j  N  .  from  Equation  k-9k. 
8  8 
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*il.  Cl*#?2!)  (k-MW) 

s 

where  L  and  L,  are  obtained  from  Equation  l*-87;  N  ,  from  Equation  U-9U. 

S  W  8 

8 

These  equations  can  be  analyzed  in  two  major  ways  with  respect 
to  E.  The  first  is  to  ascertain  within  a  given  equation  whether  there 
is  a  relationship  between  s  and  N  that  will  minimize  E  for  that  type  of 
file  organization.  The  second  is  to  compare  the  equations  with  each 
other  to  determine  whether  some  file  structures  are  always  superior  to 
others. 


To  carry  out  the  first  analysis  it  is  sufficient  to  assume  that 
s  can  take  any  positive  real  value  and  to  differentiate  each  of  the  equa¬ 
tions  with  respect  to  a,  cons j  Bering  N  as  a  constant,  and  checking  to 
see  if  the  resulting  ex'  remum  is  indeed  a  minimum.  If  there  is  such  a 
relationship  between  s  and  N,  it  provides  the  proper  number  of  subject 
headings  s  to  minimize  E  for  a  file  of  N  items  with  that  type  of 
organization. 


*In  the  following  discussion  the  values  of  s,  which  optimise  the  expected 
number  of  headings  and  items  examined,  are  obtained  for  several  of  the 
file  organisations .  This  derivation  is  accomplished  by  differentiating 
the  expression  for  E  with  respect  to  s  to  obtain  the  appropriate  s  as  a 
function  of  N  that  mini  mixes  E.  Strictly  speaking,  such  a  procedure  is 
not  permissible  because  all  the  distributions  considered  are  discrete. 

E  is  defined  only  for  positive  integral  values  of  s  and  N.  Nevertheless, 
the  equations  for  E  in  all  oases  are  continuous  functions  for  the  domains 
of  k,  s,  and  N  that  are  of  interest.  Consequently,  these  differentia¬ 
tions  can  be  carried  out  formally  and  the  relative  obtained.  To 

obtain  the  integral  values  of  s  that  minimise  E,  it  is  then  necessary  to 
substitute  the  two  integers  closest  to  the  minimum  s  into  the  equation 
for  E  to  ascertain  which  gives  the  smaller  E.  This  integer  is  then  used 
as  the  minimum,  provided  it  is  positive.  Even  this  procedure  would  not 
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For  example,  taking  the  partial  derivative  of  E  with  respect 
to  s  in  Equation  U-75A  and  setting  the  result  equal  to  zero  yields: 

s  -  JIT  (U-io5) 

A  check  reveals  that  the  appropriate  conditions  for  a  minimum  are  satisfied. 
That  is,  the  value  of  s  given  in  Equation  U-105  will  always  result  in  a 
minimum  E  for  that  N.  Substituting  Equation  U-105  in  Equation  U-75A  gives: 

Endn  “  1  +  (U-106) 

From  Equations  U— 73  and  U-105,  the  optimum  value  for  N  is: 

N  -  JIT  (U-107) 

S 

Equation  U-38A  cannot  be  treated  in  this  manner  because  it  is  a  function 
of  N  only  (and  k).  It  is  true,  however,  that  as  k  increases,  E  decreases. 
Care  must  be  taken  in  the  interpretation  of  this  result. 


Application  of  the  same  method  to  Equation  U-96A  yields : 


s 


1  r  kN 

iTTT  L"(V  VT)Iogke  -  1 


(U-108) 


This  value  of  s  for  any  N  will  yield  the  minimum  E  in  Equation  U-96A. 
The  value  of  E  is: 


E  , 
min 


3/2  *  log*  [• 


nr 


eN 

THo^e 


(U-109) 


be  sufficient  were  it  not  for  the  fact  that  these  functions,  in  the  cases 
considered,  have  only  one  relative  minimum,  and,  therefore,  this  relative 
minimum  is  also  an  absolute  minimum.  The  ultimate  justification  for  these 
unrigorous  techniques  is  that  they  do  provide  the  real  minima  and,  there¬ 
fore,  have  considerable  utility. 


( 
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Equation  U-97A  has  no  relative  minimum.  However,  the  optimum 
value  for  s  can  be  obtained  by  observation.  By  substituting  Equations 
li-87  and  U-9U  in  U-97A  and  simplifying,  the  result  obtained  is: 

E  -  [iS-J-i]  ClogkC(k  -  l)s  +  1]  -  1] 

+  logkCk(k  -  1)N  ♦  (k  -  l)s  +  1]  -  yZTZ  (U-97B) 

This  equation  is  defined  for  a  a  1.  For  this  range  of  s.  Equation  U-97B 
has  a  minimum  at  s  ■  1.  This  minimun  gives  for  E: 

E  ;  1  +  h  'TTT 

The  single  subject  heading  is  superfluous  and  can  be  eliminated.  The 
minimum  E  becomes : 

W ;  ^  -  rrr  «-“» 

Therefore,  the  optimum  s  for  Equation  U-97A  is  aero,  and  the  equation  has 
been  reduced  to  Equation  U-83A.  Consequently,  it  is  disadvantageous  to 
superimpose  a  non-in dexed  tree  of  subject  headings  on  an  indexed  tree  of 
file  items. 

Equation  U-10CUL  is  a  function  of  N  and  k  only;  again,  as  k 
Increases,  E  decreases. 

For  Equation  U-102A  the  s  that  gives  minimum  E  is : 


The  minimum  E  becomes: 

W  i  *  «•-“> 
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Equation  U-10ltA  has  no  relative  minimum.  However,  the  optimum 
value  for  s  can  be  obtained  as  follows.  By  substituting  Equations  b-S? 
and  U-8U  in  Equation  U-10ijA  and  simplifying,  it  beoomes: 

E  -  log^Ckfk  -  1)N  +  (k  -  l)s  +  1]  -  (U-10liB) 

This  equation  if  defined  for  s  4  1.  Obviously,  it  has  an  absolute  min¬ 
imum  at  s  -  1,  which  gives: 

E  ;  1  + 


The  single  subject  heading  again  is  superfluous,  and  E  becomes: 


“min 


Ln  “FTT 


(U-113) 


Thus  the  optimum  s  for  Equation  U-lQkA  is  sero,  and  this  equation  is  also 
reduced  to  Equation  U-83A.*  In  other  words,  wherever  it  is  possible  to 
construct  an  indexed  tree  of  items,  it  is  pointless  to  superimpose  an 
indexed  tree  of  subject  headings  upon  it.  It  is  also  pointless  to 
establish  any  other  system  of  subject  headings.  One  example,  namely 
Equation  H-97A,  has  already  been  considered. 


The  second  type  of  analysis  compares  one  equation  with  another 
for  an  arbitrary  but  specified  file  size  N  and  for  a  number  of  headings 
sj  the  objective  is  to  determine  whether  E  is  always  less  in  one  type 
of  file  organization  than  in  another.  Equations  U-97A  and  b-10bA  have 
been  shown  to  be  superfluous  and  will  not  be  considered. 

The  files  with  no  subject  headings.  Equations  U-83A  and  U-10QA 
will  be  considered  first.  For  a  given  N,  Equation  1*-83A  will  yield  a 
lower  average  number  of  items  searched  than  Equation  U-10QA  if: 
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(U-llli) 


L»-k4r<l(tH-1> 

This  Inequality  can  be  written: 

(LN  "  1)  ‘  FT"I  <  ?(IN  "  X) 

The  inequality  is  clearly  valid  for  k  *  2.  Consequently,  the  average 
number  of  items  examined  in  searching  an  indexed  tree  of  N  items  is 
always  less  than  the  average  number  examined  in  a  non-indexed  tree. 


For  the  case  where  the  number  of  headings  in  both  trees  is  the 
same.  Equations  U-9&A  and  i*-102A  can  be  compared  in  terms  of: 

£-5-^]  (Ls  -  1)  +  1  >  Ls 


or 


[£-5-^  <La  -  1)  >  I*8  -  1  (U-U5) 

This  inequality  is  clearly  valid  for  k  *  2  and  L  *  1.  Therefore,  Equa- 
tion  U-102A  gives  a  smaller  E  than  Equation  U-9&A.  It  is  clear,  however, 
from  Equations  U-108  and  U-lll  that  the  optimum  s's  for  the  two  trees  of 
Equations  U-102A  and  U-96A  are  not  identical.  Nevertheless,  it  can  be 
shown  directly  from  Equations  ii-109  and  U-112  that  Equation  U-102A  also 
yield.?  a  smaller  E  than  Equation  U-96A  when  s  is  optimised  in  each  case. 
This  optimisation  would  require: 


or 


logk[riJgke]  <  logk[7FTT7I5g^] 

eN  <  f  eN  1  (k+1)/2 
2  logke  L(k  +  lJlog^eJ 


(k+l)/2 


>  (U-116) 


This  inequality  is  valid  for: 
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..  _  i  (k .  i^D/O-D 

*  *  2s/(k-u  — rcg?: - 


(U-u?) 


This  condition  presents  no  restriction  for  a  practical  case.  For  example, 
Equation  U-117  requires  N*Uifk"2|  N  ^  3,  if  k  *  10}  N  *  6,  if 


k  -  100. 


For  a  given  N  and  a  given  s  >  1,  Equation  U-102A  always  gives  a 
lower  value  of  E  than  Equation  U-75A.  The  conditions  would  require: 
s  +  1 


Ls  < 


This  inequality  can  be  transformed  by  algebra  to: 

k-(s+l)/2^k  _  !)8  +  !]  <  x  (U-118) 

By  differentiating  the  left  member  of  Equation  U-118  with  respect  to  k 


and  setting  it  equal  to  zero,  a  value  for  k  can  be  obtained  to  make  it 
an  extremum.  This  value  is: 


k  -  S—l  (U-119) 

By  examining  the  second  derivative  at  this  point,  it  is  observed  that 
Equation  U-119  maximises  the  left  member  of  Equation  U-118  when  s  >  1. 


This  maximum  value  is : 

(a+l)/2 


(U-120) 

For  s  >  1,  the  Value  U-120  is  always  less  than  1.  Since  the  maximum 
value  satisfies  Equation  U-U8,  any  other  value,  in  particular  any 
k  *  2,  will  also  satisfy  it. 


When  s  is  optimized  in  each  case,  these  two  file  structures  can 
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be  compared  by  Equations  4-106  and  4-112.  Equation  4-102A  will  give  a 
lower  E  than  Equation  4-75A  in  the  optimum  case  when: 

\  *  ^ 1 

By  algebraic  transformations ,  this  Inequality  can  be  written: 

<  |  (4-121) 

When  k  ■  2,  this  inequality  is  valid  for  N  *  27;  when  k  -  4,  it  is  valid 
for  N  *  4j  vhan  k  *  ( ,  it  holds  for  N  *  1. 

The  optimum  cases  of  Equations  4-96A  and  4-75A  can  be  compared 
by  using  Equations  4-106  and  4-109.  Equation  4-96A  will  yield  a  smaller 
E  when: 

i  +  -TT** 

that  is,  vhan: 

N  In  k  ^  1  /»  -i  aa  \ 

t(k .  I)k]t(2nr-1)/(W)]  ;  lU  ) 

Equation  4*122  is  generally  valid  for  larger  files.  For  example,  a 
simple  calculation  with  k  -  10  shows  that  Equation  4-122  is  valid  for  N 
roughly  greater  than  115  and  invalid  for  smaller  N.  Hence,  the  single 
level  subject  heading  file  results  in  a  smaller  average  number  of  items 
searched  in  files  with  less  than  115  items.  This  conclusion  is  shown 
clearly  in  Figure  1. 

Figure  1  depicts  the  average  number  of  headings  and  items 
examined  for  a  wide  range  of  file  sizes.  Only  optimum  values  for  s 


N  In  k 
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KIGtruK  1.  Average  Number  of  Headings  and  Items  Examined  in  a  Search 
of  Differently  Organized  Files 


are  shown.  The  figure  indicates  the  superiority  of  indexed  trees  over 
non-indexed  trees  and  of  non-indexed  trees  over  single-level  subject  head¬ 
ings,  except  for  small  files  as  indicated  by  Equation  ii-122 .  However, 
the  degree  of  superiority  of  the  indexed  trees  is  somewhat  misleading. 
Although  it  is  true  that  the  average  number  of  headings  and  items  examined 
or  searched  for  such  trees  is  much  smaller  than  for  the  other  file  struc¬ 
tures,  this  fact  does  not  imply  much  faster  response  times.  By  omitting 
consideration  of  the  indexing  function  itself,  the  burden  of  search  lias 
in  a  sense  merely  been  shifted  elsewhere.  Unless  the  indexing  function 
is  powerful,  the  search  procedure  in  an  indexed  tree,  particularly  where 
k  is  large,  may  spend  almost  as  much  time  examining  indexes  to  determine 
the  appropriate  paths  as  would  be  involved  in  examining  the  headings 
themselves . 

A  singular  feature  of  Figure  1  is  that  the  indexed  tree  of 

items.  Equation  U-83A,  and  the  indexed  tree  cf  headings.  Equation  U-102A, 

give  similar  values  of  E.  The  same  is  true  for  the  non-indexed  trees 

represented  by  Equations  U-100A  and  1i-96a.  The  explanation,  however,  is 

simple.  Equations  U-108  and  U-lll  require  that  the  number  of  subject 

headings  should  be  so  large  that  essentially  only  a  few  items  or  even  a 

single  item  are  filed  sequentially  under  each  node  of  the  last  row.  In 

other  words,  N  is  small.  This  fact  can  be  seen  from  the  values  of  N 
s  s 

derived  from  Equations  U-9U,  U-108,  and  li-111,  respectively.  These 
values  are: 

Ng  -  (k  +  l)logke  (U-123A) 
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Ng  “  2  l0«ke 


N 


(k  s  7) 
(k  >  7) 


(U-123B) 


Consequently,  almost  all  the  searching  is  performed  in  the  tree  of  head¬ 
ings  where  it  is  most  economical.  Hence,  the  close  correspondence  arises 
between  trees  of  headings  and  between  trees  of  items.  Of  course,  in 
practice,  it  may  frequently  be  impossible  to  achieve  a  meaningful  break¬ 
down  of  related  headings  to  such  a  deatiled  level.  Therefore,  the 

optimum  values  of  s,  N  ,  and  E  should  be  regarded  as  interesting  ideal- 

s 

izations.  In  practice,  only  integral  values  of  8  and  N  can  be  used. 

s 


In  cases  where  the  optimum  curves  plotted  in  Figure  1  are 
unrealistic  because  they  restrict  s  too  much,  the  equations  developed 
in  this  and  the  previous  section  can  be  used  to  generate  complete  sets 
of  design  charts.  From  these  charts  the  best  file  organization  can  be 
read,  in  terms  of  whatever  value  s  must  have  to  reflect  the  logical  rela¬ 
tionships  and  the  nature  of  the  subject  matter  to  be  classified. 


In  the  interest  of  completeness.  Figure  2  is  included  for  ref¬ 
erence.  It  relates  the  number  of  levels  of  nodes  in  a  regular  tree  of 
order  k  to  accomodate  N  items,  one  item  per  node.  Figure  2  is  obtained 
from  Equation  h-82  or  U— 87. 

li.lt.  1.5  Variance  From  the  Expected  Values  -  The  utility  of  the 
average  or  expected  number  of  items  and  headings  examined  in  different 
file  structures  depends  upon  the  likelihood  that  the  number  of  items  and 
headings  searched  will  generally  be  near  the  average  value.  An  estimate 
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FIGURE  2.  Number  of  Levels  Required  to  Store  N  Items  in  a  Regular  Tree 
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of  this  likelihood  is  provided  by  the  statistical  variance  of  the  number 
of  items  and  headings  searched  from  the  average  number.  Expressions  for 
the  variance  relative  to  Equations  h-7kk,  U-83,*  U-96A,  U-10QA,  and 
U-102A  will  be  developed  and  analyzed. 


Directly  from  the  definition,  the  variance  £  of  the  single 


level  subject  heading  file  can  be  written: 

N 


<?  -  l  ki  -  54-i]2  *  s8  fki  -  (U-12U) 

i-l  8  2  i-l  Ns  28 


Carrying  out  the  summations  yields: 


^  .  (.  -  «  Ij  .  -  1  (U-125) 

^U-75i)  -  *  ("/s)2  *  21  (W*) 
[Note:  the  subscript  such  as  (U-7SA)  references  the  equation  related  to 
a  given  variance.] 


By  differentiating  Equation  U-126  with  respect  to  s,  setting 
the  result  equal  to  zero,  and  c hooking  the  appropriate  requirements,  it 
can  be  shown  that: 

s  -  ST  (U-127) 

gives  the  minimum  variance.  Thus  the  s  that  gives  minimum  E,  Equations 
U-105  and  U-106,  also  gives  the  minimum  variance.  This  value  is: 


In  this  case  Equation  U-83  will  be  used  instead  of  Equation  U-83A.  Equa¬ 
tion  U-83A  is  not  sufficiently  accurate  to  be  used  in  computing  the  var¬ 
iances,  because  the  variances  are  small.  The  computation  is  based  upon 
differences  between  numbers  that  are  approximately  equal. 


l 
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(4-128) 


For  the  indexed  tree  of  items,  the  variance  is: 

*  •  f  £  X3'1  h  -  E(U-fl3)]2  (k-12?) 

where  n  is  given  by  Equation  4-82.  in  elementary  theorem  of  mathematical 
statistics  states  that  Equation  4-129  is  equal  to: 

£  -  w  S  jV-1  -  E2  (4-130) 

H  J-l 

where  E  is  the  expected  value  obtained  from  Equation  4-83.  The  sum  in 
Equation  4-130  can  be  evaluated  by  using  some  relationships  among  the 
derivatives  of  arithmetic  and  geometric  series.  Generating  functions 
can  also  be  employed  directly  and  effectively,  in  this  case,  to  obtain 
,he  variance.  Using  either  of  these  methods,  the  following  expression 
for  the  variance  can  be  derived: 

°(4-83) "  ftt  fir  ■  Tir^njr  -  2Lh  +  rHc] 

♦  I*2  -  E2  (4-131) 

where  n  -  is  obtained  from  Equation  4-82  and  E  from  Equation  4-83. 
Equation  4-131  can  be  used  to  compute  the  variance  for  relatively  small 
sise  files  (moderately  large  N). 


is  N  becomes  arbitrarily  large,  however  .  Equation  4-131  approaches 


the  following  limiting  value: 

1  (7717 


(4-132) 
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Equation  U-131  converges  relatively  rapidly  to  Equation  li— 132 .  For 
example,  when  k  -  10,  the  following  errors  in  the  variance  are  intro¬ 
duced  by  using  Equation  U-132  rather  than  U-131: 


N 

Error  in 
Equation  U-132 

10* 

1.11* 

!()*■ 

.70* 

10* 

.05* 

This  point  is  primarily  of  academic  interest,  since  the  variances  given 
by  Equations  U-131  and  U-132  are  insignificant.  For  k  *  3,  the  variance 
given  by  Equation  U-131  is  less  than  1.  It  can  be  shown  that  the  variance 
is  a  monotonically  increasing  function  of  N,  and  that  Equation  U-132  is 
an  upper  limit  for  the  variance. 


Applying  similar  methods,  the  variances  for  the  other  file 
structures  were  derived.  They  are: 


■ 1  ■  *  -1-  - » *  * 


N_2  -  1 


TT 


(U-133) 


where  L  is  obtained  from  Equation  U-87;  N  ,  from  Equation  U-9U. 

8  8 


£  .  n2  -  1 

vU-ioqa)  T2 

where  n  is  obtained  from  Equation  U-99. 

JL  m  *9  "  1 

1U-102A)  12 

where  N  is  obtained  from  Equation  U-9U. 
8 


(U-13U) 


^U-135) 


The  variances  of  Equations  U-96A  and  U-102A  can  now  be  derived 


C 
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for  optimum  s.  From  Equations  U— 87  and  4-108: 


v  ■  ^  frit  ♦  yi***.] 

(4-136) 

"  1  +  logk  [(It  +  TJIog^e-l 

(4-137) 

Substituting  Equations  4-1231  and  4-137  into  Equation  4-133  yields: 

^4-9dl)opt  “  E  (k  ~  1)logk  ItTF  Dlo^e] 

+  (k  ♦  l)2(logke)2  -  l} 

(4-138) 

In  the  case  of  Equation  4-1021,  substituting  Equation  4-123B  into  Equa¬ 
tion  4-135  gives: 

_2  4(log.e)2  -  1 

^-ioei)^ - E -  (U-139) 

Whenever  the  optimun  Ng  given  by  Equation  4-123B  is  lees  than  1,  Ns  is 
taken  as  1  and  the  variance  given  by  Equation  4-139  is  aero.  The  reason 
is,  of  course,  that  in  this  case  there  is  a  unique  indexed  procedure  to 
locate  any  item  in  a  fixed  maaber  of  steps. 


The  standard  deviations  from  the  expected  values  are  shorn  in 

Figtur  3.  In  other  words.  Figure  3  is  a  graph  of  ,  ff(k_63)> 

opt 

a(4-100i)  ’  *nd  °( 4-961)  obtained  by  taking  the  positive  square  root 

'opt 

of  Equations  4-128,  4-131,  4-134,  and  4-138,  respectively.  The  graph 


eas  plotted  for  k  -  10.  FOr  this  value  of  k,  the  standard  deviation  of 
the  indexed  tree  of  headings  with  sequential  items  is  sero  for  the  reason 
given  after  Equation  4-139*  Consequently,  this  standard  deviation  has 
not  been  included  from  the  graph.  Is  Figure  3  indicates,  the  standard 
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FIGURE  3.  Standard  Deviation  From  Average  Number  of  Headings  and  Items 
Examined  in  a  Search 
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Ittt:- 


deviation  of  the  indexed  tree  of  items,  Equation  lt-131,  is  also  negligible. 
Hence,  the  expected  value  is  a  good  indicator  of  the  actual  number  of  head¬ 
ings  and  items  examined  in  a  single  search  of  an  indexed  tree.  The  stand¬ 
ard  deviation  for  the  non- indexed  tree  of  headings,  Equation  U-138,  is 
somewhat  larger;  for  the  non-indexed  tree  of  items.  Equation  U-13U,  it  is 
still  larger.  For  reasonably  large  files,  the  largest  deviation  is  the 
single  level  subject  heading  file,  Equation  U-128.  Consequently,  the 
expected  number  of  headings  and  items  examined  is  not  a  good  indicator  of 
what  will  occur  in  any  given  search  of  a  single  level  file.  This  point  is 
verified  by  anyone's  experience  with  this  kind  of  file. 

Figure  U  compares  the  cumulative  probability  distributions  for 
three  types  of  files.  It  indicates  rather  clearly  the  wi'de  variation  in 
n  among  the  file  types  (with  a  fixed  file  size)  for  any  given  probability 
that  the  number  of  headings  and  items  searched  will  be  not  greater  than  n 
in  any  single  search.  For  example,  in  a  file  of  111,111  items  the  proba¬ 
bility  is  .5  that  fewer  than  7  items  will  be  examined  in  an  indexed  tree; 
fewer  than  2$  in  a  non-indexed  tree;  but  fewer  than  335  in  a  sequential, 
single  level  heading  file. 

li.U.1.6  generalized  Expressions  for  Expeoted  Values  -  The  purpose 
of  this  section  is  to  present  generalized  expressions  for  the  expected  num¬ 
ber  of  headings  and  items  searched,  when  two  previous  assumptions  are 
removed.  These  assumptions  are: 

(a)  Each  subjeot  heading  or  item  is  equally  likely  to  be  the  one 
sought. 

(b)  The  same  number  of  items  is  filed  under  each  heading. 


93 


i 


i 


FIGURE  ti.  Cumulative  Probability  Distributions  for  a  Search  of  Differently 
Organised  Files 


For  example,  if  information  is  available  on  anticipated  or  past  activity 
of  the  file  items — and  if  this  information  indicates  the  likelihood  of  a 
given  heading  or  item  being  requested — then  the  expected  number  of  headings 
and  items  searched  can  be  obtained  in  terms  of  the  available  data  that 
approximate  the  probability  distribution  of  file  activity.  Generally,  the 
more  specialized  the  contents  of  a  file,  the  better  known  and  more  stable 
will  be  its  activity.  When  the  activity  of  the  file  is  known  and  it  is 
relatively  stable,  it  is  clearly  advantageous  to  organize  the  file  so  that 
the  items  that  have  the  greatest  likelihood  of  being  requested  are  the  most 
accessible.  For  obvious  reasons  such  a  file  is  called  activity  organized. 
It  is  the  intent  of  this  section  to  provide  a  general  background  for  the 
investigation  of  activity  organized  files  in  terms  similar  to  those  appear¬ 
ing  in  previous  sections.  Fbr  the  sake  of  simplicity,  expressions  for 
expected  values  will  be  presented  for  only  two  of  the  file  organizations. 
These  expressions  will  provide  a  starting  point  for  the  analysis  of  activ¬ 
ity  organized  files.  In  each  case,  p(i)  indicates  the  probability  that 
the  i—  item  or  heading  is  the  answer  to  a  request. 


The  single  level  subject  headings  with  sequential  items.  Equation 
U-75,  generalizes  to: 


E  -  E  ip  (i)  +  E  T  E1  jp, (j)l  p.(i) 
i-1  8  i-1  Lj-1  1  J  8 


(U-lUO) 


where  s  -  the  number  of  subject  headings  in  the  file. 

n^  ■  the  number  of  items  under  heading  i. 

p  (i)  ■  the  probability  that  the  answer  to  a  request  is  under 
8  heading  i. 
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p(j)  -  the  probability  that  item  j  is  the  answer  to  a  request. 

p.  (j)  -  the  probability  that  item  j  will  be  requested,  given 
1  that  it  is  filed  under  heading  i. 

This  last  probability  is  obtained  from: 

p(j)  ■  Ps(i)  *  Pj/J)  (U-lUl) 

The  expected  value  for  the  indexed  tree  of  items.  Equation  U-83> 
generalizes  to: 

n 

E  -  E  jp(j)  (l»-lh2) 

J-l 

where  p(j)  is  the  probability  of  finding  the  answer  on  the  cut}  it 
is  given  by: 

kJ-1 

P(J)  -  S  P.d)  (U-Ui3) 

i-1  J 

where  Pj(i)  is  the  probability  that  the  i—  node  on  level  j  is  the 
requested  item.  Values  for  n  are  obtained  from  Equation  U-82. 

U.U.1.7  Summary  -  Conclusions  have  been  developed  and  presented 
throughout  this  section  and  will  be  summarized  only  briefly.  These  con¬ 
clusions  are  valid  only  for  files  where  every  heading  and  item  is  equally 
likely  to  be  required  for  a  response. 

(a)  In  terms  of  expected  values,  indexed  trees  give  a  lower  average 
number  of  headings  and  items  examined  than  non-indexed  trees. 
Non- indexed  trees  give  lower  values  than  single  level  subject 
headings,  except  for  small  files.  The  break-even  points  can  be 
determined  precisely  from  the  equations  in  Section  U.U.l.lt. 

(b)  Whenever  a  file  of  items  can  be  indexed  or  ordered  into  a  tree 
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structure,  it  is  disadvantageous,  in  terns  of  expected  values, 
to  superimpose  any  heading  structure  on  the  items. 

(c)  For  trees  and  single  level  subject  heading  files  relationships 
between  the  number  of  headings  and  the  number  of  items  in  the 
file  minimize  the  expected  number  of  headings  and  items  that  will 
be  examined  in  a  file  search. 

(d)  The  standard  deviation  from  the  average  number  of  headings  and 
items  examined  for  indexed  trees  is  small.  Consequently,  these 
average  numbers  are  excellent  indicators  of  the  number  of  head¬ 
ings  and  items  likely  to  be  examined  in  a  single  search.  The 
deviations  for  non- indexed  trees  are  somewhat  larger,  so  expected 
values  are  of  less  utility.  Finally,  the  deviation  from  the 
expected  values  of  the  file  with  single  level  headings  and 
sequential  items  is  so  large  that  the  average  values  are  poor 
indicators  of  the  number  of  headings  and  items  examined  in  any 
single  search. 

This  study  can  be  extended  in  any  one  of  several  directions.  The 
choice  should  be  made  on  the  basis  of  how  well  the  work  can  be  integrated 
with  other  research  tasks  in  this  project.  The  utility  anticipated  from 
these  extensions  should  also  be  considered.  Some  general  areas  for  possi¬ 
ble  further  investigation  are: 

(a)  Extend  the  study  to  obtain  required  search  times— i.e.,  mean 
recurrent  events,  Reference  [£] — after  taking  into  account  the 
time  required  for  indexing  and  other  processing  functions 
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necessary  for  retrieval. 


(b)  Analyze  other  file  organizations.  Activity  organized  files 
should  be  investigated  for  several  widely  differing  distribu¬ 
tions  to  ascertain  their  advantage  in  terms  of  quantitative 
statistics.  Files  consisting  of  many  related  or  unrelated 
trees  and  non-regular  trees  should  also  be  considered. 

(c)  Consider  other  models  of  file  organization  than  tree  structures— 
e.g. ,  Harkov  chains— for  the  representation  of  the  relationship 
between  file  organization  and  search. 

k.S  INTEGRATIVE  CAPABILITIES 

The  work  on  non-Boolean  retrieval  and  on  the  comparative  analysis  of 
file  organizations  both  have  implications  for  integrative  system  models. 

To  date,  however,  no  explicit  attempt  at  the  formulation  of  such  a  model 
has  been  attempted.  Preliminary  theoretical  speculation  continually  takes 
place.  One  area  in  which  there  has  been  an  attempt  to  document  such 
speculation  concerns  the  relationship  between  frequency  and  indexing. 

U.5.1  General  Theoretical  Considerations  with  Special  Reference  to 
the  Relationship  Between  Frequency  and  Indexing  -  In  a  collection  of  n 
items,  there  is  only  a  finite  number  of  subcollections  of  items  that  are 
theoretically  possible  responses  in  item  retrieval  systems.  The  number 
is  2n  if  zero  items  are  considered  a  subcollection.  In  practice,  not  all 
2n  answers  are  equally  likely  to  be  searched  far  by  a  user.  Intuition 
suggests  that  this  disparity  is  an  essential  criterion  for  the  effective 
design  of  a  query  or  descriptor  language. 
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There  are  several  possible  approaches  to  specifying  which  of  these  2n 
subcollections  is  being  referenced.  In  one  sense  the  simplest  means  of 
specification  is  to  assign  a  name  or  descriptor  to  each  of  the  n  items  in 
the  collection.  In  the  case  when  all  2n  subcollections  are  requested 
equally  often  and  when  the  questioner  knows  the  name  of  each  item  he  is 
interested  in,  this  method  produces  an  adequate  system.  If,  however,  some 
subcollections  are  considerably  more  popular  than  others,  then  an  obvious 
improvement  in  coding  efficiency  would  result  from  giving  popular  collec¬ 
tions  special  category  names. 

There  are,  however,  other  considerations  than  information  theoretic 
measures  of  coding  efficiency  that  are  relevant  to  the  selection  of  a 
descriptor  language.  Asking  for  all  the  items  in  a  subcollection  by  name 
is  possible  only  when  the  names  of  all  the  documents  in  the  subcollection 
that  are  of  interest  are  known.  Under  these  circumstances  the  general 
problem  of  information  retrieval  becomes  a  special  case,  and  only  consider¬ 
ations  of  coding  efficiency  and,  perhaps,  user  compatibility  are  relevant 
criteria  for  descriptor  language  design. 

In  an  ordinary  library  search  the  questioner  does  not  know  the  names 
of  the  items  he  needs.  He  wants  the  system  to  supply  a  subcollection  of 
items  that  will  provide  information  relevant  to  his  query  after  he  reads 
them.  The  system  must  go  from  his  query  or  a  transformation  of  his  query 
to  an  appropriate  subcollection  of  items,  even  though  the  user  does  not 
yet  know  in  advance  what  is  in  this  subcollection. 

How  can  the  system  do  this?  One  approach  is  to  ask,  perhaps  implicitly, 
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questions  in  advance  and  to  search,  again  implicitly,  the  entire  collection 
to  find  the  items  that  contain  information  relevant  to  each  question.  The 
system  would  then  have  the  stored  answer  available  whenever  the  same  ques¬ 
tion  arose.  In  a  sizable  collection  it  is  not  feasible  to  ask  all  ques¬ 
tions  in  advance.  There  are  two  reasons:  first,  there  are  a  large  number 
of  ways  of  asking  essentially  the  same  question;  another  way  of  putting 
this  point  is  that  the  same  answer  subcollection  would  satisfy  many  possi¬ 
ble  question  variations.  Second,  there  are  too  many  possible  answers— 
specifically,  2n — in  any  sizable  system. 

Each  of  these  difficulties  requires  a  different  approach.  The  approach 
to  the  former  involves  standardization;  that  is,  the  possible  ways  of 
asking  essentially  the  same  question  must  be  restricted.  This  solution  is 
essentially  a  language  problem.  The  approach  to  the  latter  difficulties 
involves  exclusion  of  less  probable  questions  and  their  resultant  answers 
from  advance  treatment.  This  solution  is  essentially  a  system  design  and 
organization  problem. 

How  is  explicit  or  implicit  advance  treatment  of  questions  possible? 

One  method  would  be  to  have  all  documents  in  the  library  unordered,  except 
perhaps  by  author  and  title  for  those  searches  in  which  the  quarter  already 
knows  which  documents  he  wants.  Anyone  wishing  to  use  the  library  could 
then  be  asked  to  submit  both  a  copy  of  his  question  and  a  list  of  the 
documents  he  found  relevant  after  making  his  search  of  the  library.  This 
information  could  then  be  stored  for  occasions  when  the  seme  or  similar 
questions  are  asked. 
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Of  course,  this  scheme  is  impractical.  Listing  some  of  its  inherent 
difficulties  may  lead  to  an  understanding  of  the  requirements  of  an  ideal 
descriptor-query  language. 

(a)  There  is  no  assurance  that  any  initial  questioner  will  do  a  good 
or  thorough  job  in  searching  all  the  documents  in  the  library. 

(b)  Even  if  the  initial  questioner  has  done  a  perfect  job  at  the  time 
he  searched  the  library,  there  would  be  a  lack  of  information 
about  the  relevance  of  new  accessions  to  the  question.  Of  course, 
new  accessions  could  be  re-searched  by  subsequent  questioners  in 
order  to  keep  the  answer  list  up  to  date. 

(c)  Many  questions  will  recur  imprecisely;  and  even  if  the  statement 
of  the  question  is  identical,  different  users  are  likely  to  have 
different  meanings  or  intentions  that  would  influence  which  docu¬ 
ments  they  considered  appropriate  for  the  answer  list.  Thus, 
even  if  there  is  a  perfect  and  up-to-date  search  performed  by  the 
initial  questioner,  it  is  not  likely  to  be  perfect  for  a  subse¬ 
quent  questioner. 

(d)  Such  a  system  would  impose  an  unacceptable  search  burden  not  only 
upon  initial  questioners  but  also  upon  subsequent  questioners,  if 
there  are  a  substantial  number  of  new  acquisitions.  Furthermore, 
the  aske*  3  of  somewhat  unusual  questions  would  always  tend  to  be 
in  the  role  of  initial  questioners,  regardless  of  how  long  the 
system  has  been  in  operation.  Their  extensive  search  efforts 
would  rarely  be  applied  by  subsequent  users. 
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The  technique  currently  used  by  most  libraries,  in  order  to  dead  with 
these  objections,  is  implicitly  to  select  a  range  of  questions  to  be  pre¬ 
answered  and  then  to  assess  the  relevance  of  each  accession— i.e. ,  index 
it — to  all  these  questions  as  it  is  entered  into  the  library  file.  To  the 
extent  that  a  document's  relevance  to  many  questions  can  be  assessed  nearly 
simultaneously,  this  technique  has  obvious  advantages  over  repeatedly  scan¬ 
ning  each  document  for  etch  question  in  some  sequence  of  questions. 

The  approach  of  classifying  each  accession  for  all  questions  will  deal 
completely  only  with  difficulties  (a)  and  (b).  Difficulties  (c)  and  (d) 
will  be  resolved  only  to  the  extent  that  the  question  list,  against  which 
each  document  is  implicitly  being  checked,  is  sufficiently  extensive  and 
to  the  extent  that  the  meaning  of  these  implicit  questions  is  sufficiently 
clear  to  the  system  users. 

It  is  likely  that  none  of  the  difficulties  will  ever  be  resolved  com¬ 
pletely.  Even  a  user  searching  on  the  basis  of  his  own  question  is  likely 
to  introduce  inadvertent  errors  of  both  inclusion  and  exclusion  on  the 
answer  list  if  he  is  scanning  a  large  file  collection.  Similar  errors  will 
occur  when  a  librarian  classifies  a  book.  But  additional  errors  will 
result  from  the  fact  that  the  meaning  of  the  implicit  questions  reflected 
by  the  classification  varies  from  person  to  person. 

These  errors,  while  often  significant,  are  not  as  basic  a  problem  as 
the  limitation  on  possible  questions  that  can  be  answered.  These  limita¬ 
tions  are  a  necessary  concomitant  of  indexing  a  large  collection.  As  has 
already  been  suggested,  there  are  two  kinds  of  limitations: 
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(a)  Basic  limitations  on  the  retrieval  of  all  2n  answers.  In  general, 
no  indexing  scheme  for  a  sizable  collection  is  sufficiently  artic¬ 
ulated  to  allow  retrieval  of  all  possible  answers  without  knowing 
the  runes  of  individual  documents. 

(b)  Secondary  limitations  on  the  acceptability,  or  communicability, 
of  a  specific  question  formulation  that  does  in  fact  correspond 
to  one  of  the  accessible  answers. 

The  latter  limitation  does  not  neoessarily  imply  any  change  in  the 
logical  organization  of  the  indexing  or  query-descriptor  language.  The 
problem  is  one  of  using  appropriate  names  or  labels  for  the  index  terms 
or  combinations  of  index  terms  that  correspond  to  those  of  the  2n  answers 
that  the  system  is  capable  of  generating.  Of  course,  the  problem  is  not 
one  that  can  be  solved  merely  by  the  judicious  selection  of  terms.  It  is 
necessary  that  the  questioner  and  the  library  system  use  these  term  in 
essentially  the  same  sense.  Furthermore,  it  1s  necessary  that  alternate 
descriptions  of  the  same  answer  or  question  be  interconvertible,  either 
by  the  library  system  or  by  the  user.  To  date,  the  only  methods  of  deal¬ 
ing  with  this  problem  have  been  to  provide  the  user  with  a  dictionary-type 
description  of  the  index  term,  an  over-view  of  the  relationship  among  the 
term  used  by  the  system,  and/or  a  thesaurus  type  of  referral  ("see*  and 
"see  also")  to  related  term. 

The  problem  of  converting  synonymous  descriptions  probably  cannot  be 
approached  by  considering  the  relative  frequency  of  subcollection  questions. 
Of  course,  the  more  popular  a  subcollection,  the  more  valuable  it  might  be 
to  be  able  to  deal  with  alternate  ways  of  describing  it.  The  problem  of 
unaskable  questions,  however,  can  only  be  approached  fruitfully  from  this 
point  of  view.  If  the  system  is  to  be  insufficiently  articulated  far  the 
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retrieval  of  all  2n  possible  answer  collections,  it  seems  that  the  criteria 
(other  than  random  exclusion  based  upon  cost  considerations)  for  deciding 
which  subcollections  are  to  be  retrievable  should  ultimately  be  based  upon 
the  frequency  of  user  demand.  Only  those  questions  that  will  rarely  or 
never  be  asked  should  in  principle  be  unanswerable— without  searching  the 
entire  collection — because  of  limitations  in  the  query  language  and  the 
accompanying  file  structures  and  search  procedures. 

This  conclusion  suggests  that  a  second  consideration,  besides  the  rela¬ 
tive  frequency  of  user  demand  for  various  possible  answers,  may  be  impor¬ 
tant.  This  consideration  is  the  absolute  level  of  demand  far  a  possible 
answer  subcollection.  The  absolute  level  of  demand  is  readily  calculated 
from  estimates  of  relative  demand  and  the  total  number  of  questions  asked. 

An  estimate  for  the  number  of  questions  may  be  the  length  of  time  far  which 
the  collection  of  items  trill  be  used  multiplied  by  levels  of  use  such  as 
questions  per  day  during  this  interval.  As  absolute  use  of  the  system  as 
a  whole  increases,  more  articulate  indexing  becomes  necessary  to  include 
the  relatively  less  frequently  asked  questions,  which  now  are  asked  a 
significant  number  of  times  in  the  system's  lifetime. 

Answer  subcollections  should  not  merely  be  regarded  as  accessible  or 
inaccessible  with  a  given  query  capability.  Sven  if  a  subcollection  is 
not  immediately  accessible,  there  are  degrees  of  desirability  that  can  be 
discriminated  with  in  regard  to  its  inaccessibility.  Thus  a  desired  answer 
subcollection  may  not  be  directly  accessible  per  se,  yet  it  may  be  wholly 
embedded  in  another  subcollection  that  is  accessible  and  that  contains  few 
additional  items.  Clearly,  there  is  no  great  deficiency  in  query  capability 


under  such  circumstances  so  long  as  the  user  can  identify  and  ask  for  the 
appropriate  inexact  eubcollection.  If,  however,  the  items  in  a  desired 
inaccessible  subcollection  are  widely  scattered— that  is,  the  items  cannot 
be  obtained  without  searching  a  number  of  accessible  subcollections — the 
situation  is  quite  different.  This  difficulty  is  likely  to  be  further 
complicated  by  the  inherent  unavailability  of  information  about  which 
accessible  subcolleotions  contain  the  items  the  user  needs.  Under  such 
circumstances  the  user  may  be  reduced  to  searching  the  entire  collection, 
or  unacceptably  large  parts  of  it,  in  order  to  obtain  the  needed  informa¬ 
tion.  It  might  be  fruitful  to  develop  rigorous  measures  of  degree  of 
inaccessibility  based  upon  minimal  and/or  maximal  false  drops  and/or 
misses. 

Such  a  measure  of  accessibility  could  be  used  to  evaluate  the  goodness 
of  any  descriptor  scheme  for  any  item  collection.  More  precisely,  it  could 
be  used  to  measure  the  average  ( in ) accessibility  far  the  power  set  of  items, 
the  set  of  2n  possible  answers,  for  a  given  descriptor  scheme.  When  com¬ 
bined  with  information  about  relative  frequencies  of  the  members  of  the 
possible  answer  set,  such  a  measure  can  provide  information  about  the 
average  accessibility  of  items  per  request.  The  main  purpose  of  a  general 
theory  of  information  retrieval  is  to  provide  an  analytical  framework  in 
which  this  quantity,  the  average  accessibility  per  request,  can  be  opti¬ 
mised,  given  a  context  of  relevant  system  parameters. 

Some  of  the  relevant  parameters  that  such  a  model  should  ultimately 
encompass  are: 

(a)  Number  of  items  in  the  system. 
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(b)  Number  of  descriptors. 

(c)  Articulation  of  descriptor  scheme. 

(d)  Cost  per  descriptor  assignment. 

(e)  Cost  per  false  drop. 

(f)  Cost  per  miss. 

(g)  Cost  per  search  unit. 

(h)  Cost  per  file  unit. 

(i)  Number  of  queries. 

The  development  of  models  for  estimating  the  cost  parameters  is  an  impor¬ 
tant  problem  on  which  further  work  is  necessary — see  Reference  [6].  Such 
quantities  can,  however,  be  treated  parametrically  in  a  general  information 
retrieval  model,  and  valuable  insight  into  the  design  of  optimal  descriptor 
schemes  may  thus  be  obtained. 

It  may  be  objected  that  the  basic  datum  of  such  a  model — viz . ,  esti¬ 
mates  of  relative  frequency  of  reference  to  members  of  the  power  set  of 
items— is  virtually  impossible  to  obtain  in  detail.  It  is  undoubtedly 
both  impossible  to  get  completely  accurate  estimates  and  impractical  to 
get  even  inaccurate  estimates  for  each  member  of  the  power  set  of  a  sub¬ 
stantial  number  of  items.  This  difficulty  does  not,  however,  preclude 
parametric  treatment  of  the  distribution  of  relative  frequency  of  reference 
among  the  members  of  the  power  set. 

It  is  possible  to  estimate  intuitively  some  of  the  consequences  of  the 
relative  frequency  of  reference  distribution  in  the  answer  set.  If  all 
answers  are  equally  probable,  there  would  seem  to  be  no  basis  for  choosing 
among  which  answers  should  be  accessible.  Under  these  circumstances  a 
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Uniterm  type  of  descriptor  system,  in  which  there  are  no  hierarchical 
organisations,  might  be  most  efficient.  If,  on  the  other  hand,  it  is  dear 
that  many  members  of  the  answer  set  are  answers  in  principle  only  and  that 
such  a  collection  of  items  would  rarely  be  called  for,  then  a  hierarchical 
organization  of  the  index  may  be  appropriate.  Similarly,  when  the  cost  of 
false  drops  is  relatively  high  then,  for  a  given  number  of  descriptors,  the 
average  number  of  documents  referenced  per  descriptor  should  be  relatively 
small.  If  the  cost  of  misses  is  emphasized,  however,  then  the  average  num¬ 
ber  of  documents  per  descriptor  should  be  relatively  large. 


A  rigorously  formulated  model  would  test  and  add  quantitative  depth  to 
such  intuitive  conclusions  and  would  probably  generate  other  unforeseen, 
but  perhaps  more  significant,  relationships. 
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5.  CONCLUSIONS 

ill  areas  of  capability  have  been  extended  by  analytical  study  of 
aspects  of  the  information  retrieval  problem  that  required  fuller  def¬ 
inition  and  articulation.  Input  capabilities  ham  been  specifically 
dealt  vith  from  the  viewpoints  of  using  word  frequency  as  an  indicator 
in  automatic  indexing  and  of  us'ng  a  non-Boole an  retrieval  scheme. 

Query  capabilities  were  analysed  for  the  purposes  of  automatic 
extracting  and  redundancy  control.  Organisation  and  search  schemes 
were  specified  and  their  implications  compared.  A  preliminary  consid¬ 
eration  of  the  relation  of  frequency  to  the  assessment  of  indexing  and 
thus  to  a  model  for  system  integration  was  presented. 

Most  of  these  contributions  are  still  essentially  in  the  analytical 
and  research  stuge*  The  only  area  that  could  currently  proceed  to  exper¬ 
imental  implementation  is  the  work  on  automatic  extracting.  Because  of 
the  magnitude  of  such  a  task  and  its  subordinate  position  in  the  project 
as  a  whole,  it  is  recommended  that  this  work  be  continued  as  a  separate 
project. 
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6.  FUNS  FOR  KBCT  QUARTER 

Activities  during  the  next  quarter  will  proceed  with  the  over-all 
goal  of  developing  a  theory  of  information  retrieval  for  use  as  a  tool 
in  the  design  of  information  retrieval  systems.  This  work  will  proceed 
within  the  specific  task  framework  described  in  this  report.  The  gen¬ 
eral  emphasis  will  continue  to  be  analytical  with  the  primary  purpose 
of  developing  methods  to  evaluate  the  relationship  among  significant 
system  parameters.  Toward  these  ends  work  on  literature  accession  and 
review  will  continue  to  be  a  significant  feature  of  the  next  quarter's 
activities. 

Under  input  capability  work  will  continue  on  problems  of  automatic 
classification  with  a  view  to  generalising  to  the  case  of  non-exclusive 
classes.  Planned  extensions  of  this  work  include  work  on  the  optimal 
definition  and  location  of  class  boundaries  and  on  the  evaluation  of  the 
adequacy  of  prediction  and  classification  schemes .  Work  on  non-Boolean 
retrieval  will  be  continued,  is  already  indicated,  this  work  has  impli¬ 
cations  for  more  general  areas  of  capability. 

Older  query  capabilities  no  extension  of  presently  reported  work  on 
automatic  extracting  is  planned.  Further  extensions  on  redundancy  are, 
however,  being  considered.  The  formulation  of  a  general  theory  of 
deecriptor  languages  based  upon  frequency  and  accessibility  will  have 
important  implications  for  improving  query  capabilities . 

Under  processing  capabilities  it  is  planned  to  focus  cm  the  problem 
of  associative  techniques— an  area  that  has  been  relatively  untouched 
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during  this  quarter.  Some  specific  possibilities  for  extension  on 
organization  and  search  have  been  enumerated.  A  special  intensive 
review  of  the  applicability  of  the  multi -list  scheme  of  Prywes  and 
Gray  is  also  planned  for  this  area. 

Under  integrative  capabilities  it  is  planned  to  attempt  more  rig¬ 
orous  formulations  of  the  kind  of  system  model  alluded  to  in  Section 
U.5.1.  Further  documentation  in  the  area  of  general  theoretical  consid¬ 
erations  may  also  be  expected. 
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7.1  PERSONNEL  ASSIGNMENTS 

The  following  personnel  were  assigned  to  the  project  during  the 
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Senior  Specialist 
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Alfred  Trachtenberg 

Senior  Program  Analyst 

U50 

Alexander  Ssejman 

Senior  Specialist 
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7.2  BACKPROOND  OP  PERSONNEL 

The  backgrounds  of  personnel  originally  assigned  to  the  project  were 
described  in  the  First  Quarterly  Report.  One  new  person  was  assigned  to 
the  project  this  quarter.  A  description  of  his  background  follows. 

7.2.1  Alexander  Ssejman  -  BS,  Physics,  City  College  of  New  Toxic, 
1956;  Ml,  Mathematical  Economics,  New  York  University,  1962 j  Graduate 
work  in  Physics,  New  York  University.  Activities  involve  mathematical 
analyses  of  adaptive  and  learning  information  systems.  Previous  expe¬ 
rience  Includes  mathematical  analysis  of  diverse  engineering  problems 
and  computer  simulation. 
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